Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device

— Post-stroke dysarthria (PSD) is a widespread outcome of a stroke. To help in the objective evaluation of dysarthria, the development of pathological voice recognition and technology has a lot of attention. Soft robotics therapy devices have been received as an alternative rehabilitation and hand grasp assistance for improving activity daily living (ADL). Despite the significant progress in this field, most soft robotic therapy devices use a complex, bulky, lack of pathological voice recognition model, large computational power, and stationary controller. This study aims to develop a portable wirelessly multi-controller with a simulated dysarthric vowel speech in Bahasa Indonesia and non-dysarthric micro speech recognition, using tiny machine learning (TinyMl) system for hardware efficiency. The speech interface using INMP441, compute with a lightweight Deep Convolutional Neural network (DCNN) design and embedded into ESP-32. Feature model using Short Time Fourier Transform (STFT) and fed into CNN. This method has proven useful in micro-speech recognition with low computational power in both speech scenarios with a level of accuracy above 90%. Realtime inference performance on ESP-32 using hand prosthetics, with 3-level household noise intensity respectively 24db,42db, and 62db, and has respectively resulted from 95%, 85%, and 50% Accuracy. Wireless connectivity success rate with both controllers is around 0.2 - 0.5 ms.


INTRODUCTION
Strokes are a type of illness where parts of the brain are damaged from blood clots, resulting in various after-effects depending on the affected area and, in many cases, leading to death [1,2,3].Research has revealed that strokes impact approximately 16 million individuals worldwide each year, and about 57% to 69% of stroke patients experience speech disorders, including dysarthria and aphasia or both [4].Poststroke dysarthria (PSD) is a common and persistent aftermath of a stroke, affecting most of all acute stroke cases [5,6,7,8], but this group has received limited research attention, especially particularly pronounced in Indonesian speakers.Non-invasive brain stimulation (NIBS), Lee Silverman voice treatment (LSVT) and other traditional dysarthria assessments primarily depend on invasive testing with different speech impairment treatments due to subject severity classes, leading to significant variations clinical improvement and efficacy [9][10][11][12][13].To provide an objective basis for diagnosis, treatment, and assistive, pathological voice recognition technology has emerged as a helpful device in distinguishing transdisciplinary program for dysarthtria treatment.This has led to increased interest and attention from both society and the industry for clinical rehabilitation, with the aid of integrated robotic devices like Amadeo TM , HandMentor, HomeRehab, ReJoyce, Robotherapist 2d, and others [14,15,16,17].Enriching patients's motivation to do their basic needs for activity daily living (ADL), and increasing level of independency for the sufferer.
A researcher developed an integrated robotic controlbased rehabilitation that makes it easier for medical personnel to carry out renewable methods that accompany clinical therapy sessions to be able to do it independently, and ergonomically [18,19].Exo-glove poly device development became popular among researchers to strive for ergonomics, safety, portability, and reliability for users.The choosen of materials [20,21,22,23] with a different mechanical approach, and a variety of control also have a significant change for a user to get the best user experience.
The underactuated motor tendon-driven mechanism concept uses a servo motor system, with fewer cables transmission.Control and navigation system that switches from pneunets or combustion driver actuation that is heavy and bulky, to an underactuated servo control mechanism [22][23][24][25][26][27][28][29][30] that has a lot of room for portability.Due to come with a novel for enhanced control experience, this study develops a more comfortable way to control such therapy devices, like Exo gloves with an edge or TinyML system, using a simulated dysarthric speech interface with keyword spotting system (KWS) for a better and more reliable user experience.
Researchers use deep neural networks (DNN) to conduct further analysis of pathological voices, due to its exceptional capability in learning features for identifying or operating a device [31][32][33][34][35][36][37].However, most existing method have limitations in detecting pathologhical voice with hardware efficiency.This is because they rely on classical features such as Mel-Frequency cepstrum coefficient (MFCC), which have limitations in fully capturing the deep characteristics of pathological sounds, and require a significant amount of computational power and RAM to run.
Bambang Riyanta, Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device David et al. [31] Research related to ASR in dysarthria with KWS uses the CNN model, developing a software ecosystem for full recognition of dysarthric speech for certain needs of patient assistance running using personal computer.The results of the implementation accuracy in modeling on average for all target classes are 86%.
Y.-Y.Lin et al. [32] compared speech feature models for dysarthric speech preprocessing and proposed a Convolutional Neural Network with Phonetic Posteriorgram (CNN-PPG) model.They also compared this model with a CNN model using Mel-frequency Cepstral Coefficients (CNN-MFCC) and an ASR-based system.The experimental results showed that the CNN-PPG system had 93.49% accuracy, which was better than both the CNN-MFCC (65.67%) and the ASR-based systems (89.59%).Furthermore, the CNN-PPG model had a smaller size, consisting of only 54%.
Yakoub et al. [33], developed dysarthric speech recognition using empirical mode, EMDH-CNN approach, combining empirical mode decomposition and Hurst-based mode selection (EMDH) with a CNN for enhancing dysarthric speech recognition.The EMDH technique acts as a preprocessing step to improve speech quality, followed by feature extraction using MFCC from the processed speech, and overall consisting around 62.7% accuracy.
Vachhani et al. [34] conducted a study on the use of data augmentation through temporal and speed modifications on healthy speech to simulate dysarthric speech.They evaluated the results with both DNN-HMM based Automatic Speech Recognition (ASR) and Random Forest classification.The synthetic dysarthric speech was classified for its severity level using a Random Forest classifier trained on real dysarthric speech samples.The results showed an improvement of 4.24% and 2% in the ASR performance when using tempo-based and speed-based data augmentation respectively, compared to using healthy speech alone for training.Shamiri et al. [35] focused on presenting the Speech Vision (SV), which addresses by recognizing word shapes using visual-feature representations.Additionally, SV deals with limited dysarthric speech samples by applying data augmentation, generating synthetic speech, and utilizing transfer learning.Evaluations showed SV achieving average Accuracy rate of 61.11% and 64.71%.Joshy et al. [36] discover present a comparative study on the classification of dysarthria severity levels using a pretrained DeepSpeech-1 ASR model.Trained with 1000 hours of data, achieved comparable results to the simple MFCC-DNN with an unbalanced small dataset and achieved an average accuracy of 70.52%.
Jiang et al. [37] present a paper that suggests a hybrid recognition model that combines 1DCNN and Double-LSTM (DLSTM) networks based on MFCC features of pathological voices.To build the model, a database of syllable pronunciations was created using Mandarinspeaking participants including normal adults and patients with PSD.The 1DCNN network was then used to process the MFCC features of the syllable pronunciations and extract deeper hidden features.The results of the experiments showed that the further processing of MFCC features by the 1DCNN network significantly improved the performance of the DLSTM-based recognition model, with an accuracy of 82.1% at the syllable level and 97.4% at the speaker level.
Another variety of interfacing pursued in supporting servo control for therapy devices, such as in studies that focus on bio-electric sensors using electromyography (EMG), electroencephalography (EEG), electrocardiography (ECG), and phyisichal sensory eg, Flex sensor, leap motion, Intertial measurement unit sensory (IMU), even multisensory approach and inference models on hand rehabilitation devices [24,25,38,39].Thus sensor will have a high level of accuracy when using a multichannel design with a sensitive, relatively expensive, and larger acquisition device and hig computational power.Hence, the speech command recognition (SCR) approach with lightweight CNN is used in this study for ergonomic control and produce a smaller resolution by applying the micro speech method or KWS model [40][41][42][43][44][45][46][47], which uses STFT to generate spectrogram images for CNN modeling features.This research contribution is to present detailed process development of dysarthic and non dysarthic SCR based using TinyML system, and proposed KWS algorithm model with limited dataset, and low stake perform of augmentation using ambient noise and normalise voice data.Trained model converted and inference for low computational power in to microcrontroller, then wirelessly navigate underactuated servo motor for ADL activity on theraphy device which simulate using a hand prosthetic.Also, pre build for Bahasa corpus for Indonesia dysasrthic speech dataset.This paper will contain the following sections in a sequence:  [48][49][50][51][52][53][54][55] and single-channel INMP441 as microphone modules.The modeling will be generated and converted for tiny machine learning (TinyMl) requirements on the ESP32 microcontroller using Jupyternotebook and the python language, along with libraries and modules.The button control design uses two buttons using internal pullups on the ESP32, which is configured through C/C++ programming, with a normally closed button and loop system to control servo per increment continuously.The inferencing model and final firmware development with the ESP32 microcontroller will also be compiled using the C/C++ programming language.The compiled firmware were develop using transfer learning method from open framework, and pre trained model by microspeech firmware from TensorFlow [42].

A. Preparation of the Dataset
This study SCR approach to dysarthric speech as in Fig. 2 which refers to Ref. [16,17], used Indonesian vocal vowel sound recordings, namely ''eee'' and ''iii'' on healthy subjects mimicking a dysarthric way of speech by two healthy subjects.Selection of vowels ''eee'' and ''iii'' to reduce false negatives and false positives in models with vowel phonemes that have a probability close to the production of other phonemic sounds or the resonant characteristics of voice, such as ''eee'' and ''aaa'' or ''eee '' and ''ooo'' based on linguistic research based on the Vowel Space Area (VSA) conducted by previous researcher [71].The recording was performed with parameters based on speech_command v2 datasets [72], as shown in Table I.Recordings vowels dataset were taken 100 times each, with a total of 500 recordings for dysarthric vowel mimicking speech ''aaa'','' iii,'', ''uuu'', ''eee'' and ''ooo''.The audio format used was *WAV, 16-bit, mono, and a sampling rate of 16 kHz, for 1s, at a noise around 35db, and recorded using a smartphone.The final dataset was grouped into one folder with a ratio of 80:10:10% for training, validation, and testing, which will be extracted into a spectrogram as an image transformation from sound waves for CNN modeling features.The amount of data is 200 recordings of vowel speech, namely ''eee'' and ''iii'' from TORGO dataset and single subject with mimicking dysarthic speech, two types of speech from speech_command v2 with a total, 3221 recordings, namely ''forward and backward'', six types of background noise in addition to data augmentation, and 33 other types of speech with a total of 102,608 recordings from folders speech_command v2 becomes invalid class.The total dataset used for model training is 106,035 recorded data with 5 target classes, namely, vocal ''eee'', ''iii'', ''forward'', ''backward'' and invalid, for overall information from the dataset variables shown in Table II.

B. Short Time Forier Transform (STFT)
Short Time Fourier Transform (STFT) is a mathematical transformation that decomposes a function in the time domain into its constituent frequencies [73,74,75,76].This transformation is widely used, especially in signal processing [77].Defined in equation ( 1), the Fourier transforms a function in the time domain f(t) into another function in the frequency domain f(x).
Other transformations related to the Fourier are also used to convert from time to frequency domain for both continuous and discrete data [78].STFT is a discrete Fourier transform (DFT) used to determine the sinusoidal frequency in the local part of the signal as the signal changes with time.In other words, STFT is a Fourier transform in a windowed signal, as shown in Fig. 3.
STFT provides localized information (concerning time) of the frequency component, in contrast to the standard Fourier transform, which includes frequency information over time intervals [74].STFT is formulated as the product of the signal with a weight which is called the window x(t), where is the Spectro-temporal index as in equation (2).
where w = Hamming window, n = time index, N = sampling rate.In its application, converting the sample of sound waves in the time domain into the combined time and frequency domain, using equation ( 2) to decompose the bin frequency into a spectral quantity (magnitude), represented by 2-dimensional color output.The spectogram parameter approach as a general image is used to build a short speech recognition system using the CNN model.In this study, computational configuration with STFT was carried out using the NumPy and SciPy library in python, with the function configuration w(n) using a hamming window, 50% overlap in each frame or window size, sampling rate using 16 kHz, a window size of 320 points, amplitude for the duration of the sample per frame blocking is 20ms without overlap, with 50% overlap in each blocking frame value is obtained stride in the amplitude, or the stride size is 160 amplitude points, sampled 10ms for each bin frequency.Each configuration windowed for the internal time of the voice signal for 1 second.Representation bin frequency through computation, to be transformed by the STFT algorithm into a spectral quantity in the form of a spectrogram.Fig. 4 is the downsampling with logarithmic scale of spectrogram (log-spectrogram) was employed, using average pooling and pixel reduction with a 1x6 kernel was intended to reduce the number of initial pixels of the spectrogram to 99x43 pixels as the primary input to the CNN model.This preprocessing method was conducted from previous research and has significant result to make spectrogram feature convertible for small form factor [79][80][81][82][83][84][85].
Bambang Riyanta, Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device

C. Convolutional Neural Network
Lightweight CNN architecture is one of a deep learning commonly used in image data processing in a small form factor neural net [86][87][88][89][90].In this research, can be seen in Fig. 5 which is the architecture consisting of convolution, neurons with weights and biases, backpropagation, and activation functions.The CNN method has two stages: the feedforward, and the learning stage using the backpropagation neural network (BPNN).CNN's algorithm is similar to a multilayer perceptron (MLP), but each neuron on CNN is presented in two-dimensional form.Unlike the case with MLP, where each neuron only has one dimension.In this study, the architecture model was developed using a Jupyternotebook with Tensorflow and Keras libraries and inspired by a pre-trained model from TensorFlow [42].Fig. 6 is the flowchart of architecture 2DCNN then uses a sequential method with several layers including 2 convolution layers with 4 filters with a 3x3 kernel, 2 maxpooling for downsampling output shape with pool size 2x2 and ReLU activation, the next layer is a fully connected layer or MLP including, flatten layer, dense layer with several neurons 80-unit nodes and dropout of a neuron weight about 10% or 0.1, activation function using Softmax, and then BPNNs layer.
Dataset augmentation and configuration on the MLP uses a kernel tuning to determine and limit the value of the neuron weights in the convulsion process or weight initializer, one of which is the regulation of kernel weights or filters with L2 regulators using a threshold with kernel size 1x1 on the convoulution process in addition to Prevention of overfitting [91].
The spectrogram image in the final dataset or feature for modeling is 99 x 43 pixels.Then the model fits with 10 epochs and is carried out per 30 batches, and the following are the results map with a summary of the algorithm sequence: • Input Feature, input for CNN architecture is an image spectrogram with a size of [99 x 43] pixel.The input image represents the voice signal taken from the STFT operation with a total duration of 1s and the previously explained dataset parameter.• First Convolution Layer, in this study, every convolution using a single channel performed in each layer to produce a feature map with a mathematical model (4).In the first Convulsion Layer, the input image will be convoluted by a kernel with dimension [1 x 1], with the number of filters 4, stride [1 x 1], and padding "same" with a value of 0. If the padding is "same" the system will apply the zero padding to the input matrix.
The value of zero padding can be determined using equation (5).
Due to the filter size used by the l2 regularization, from the equation, the output shape for convolution on ( 6) is obtained, and the stride size is [1 x 1].The output shape of the first convulsion layer is the same as the size input shape of the input image, namely [99 x 43].The activation function uses the ReLU layer, which aims to change the minus value at the output of the convolution process to zero due to non-linearity.
() =  −  + +2  + 1 • First Maxpooling Layer, the number of input shapes for the first convolution is the output shape for the first maxpooling layer.In this layer, the image will be with padding mode 0. The padding value in this layer can be calculated using equation ( 1) which can also be found in the first convolution layer.While output size on the max-pooling layer uses the following equation ( 6), the output of the max-pooling layer based on equations ( 6) and ( 7) is [49 x 21].
• Second convolution process is to continue the pooling result by inputting an image matrix of [49 x 21] with the same configuration as the first convolution so that the output shape remains [49 x 21].
• Second Maxpooling layer, the number of output shapes on the second convolution is the input shape, using the same configuration as the first max-pooling and producing an output shape [24 x 10].
• And the last, fully connected layer, calculate the final dot product, weights, and biases using the BPNN technique, with loss cross-entropy autocorrelation dot product and stochastic gradient descent (SGD) operations, as well as weight transformation with the flatten, at this stage, it is used to convert the output pooling layer into a 1-dimensional vector.In the propagation and classification process or predicting images, a dropout regulation technique selects several neurons randomly and will not be used during the training process; in other words, the neurons are discarded randomly.This process aims to reduce overfitting during the training [91,92,93,94].Furthermore, another fully connected process uses dense with 80-unit nodes and the layer using SoftMax activation function, and this layer becomes the last layer that will calculate the probability of the input image against all target classes, which is possible and will then determine the target class based on the given input compact.

D. Model Conversion and Quantization
The conversion process is required for inferencing a pretrained model to every compact embedded microcontroller.This study used a library and an API interpreter provided by TensorFlow, which adjusts the data type for the C array on ESP32 to store the modeling conversion via ESP32 readonly memory (ROM) so that quantization and transformation when using the function of tf.lite converter for microcontrollers, only a tiny amount of memory was needed [42].Quantize model is carried out one time of 8bit quantization according to the capacity of ESP32.After conversion and quantization, the modeling weight value will be hex code in a C array with .ccformat.

E. Button Control
Button control using two tactile buttons, or momentary buttons, with an internal pullup on the ESP32, then connected to 2 analog GPIOs on the ESP32 with an 8-bit unsigned char integer data type, 0-255 with a normally open gate, and for loop sequence for 0-180-degree servo increments.The button sequence sub-program will be combined with the last stage of TinyMl firmware.

F. Wireless Protocol with ESP-NOW
Wireless communication is provided by ESP32 using Wi-Fi infrastructure with an organization identifier marked on the MAC address number on the microcontroller device known as ESP-NOW, so peer-to-peer communication must register between microcontroller devices and enable to have repairing process efficiently.The control topology used in this study is unicast, as shown in Fig. 7.

G. Final Embedded TinyML Master-Slave
The final multi-control firmware using 2 separate firmwares.The master and slave firmware for each of the 2 ESP32; the ESP32-Master, as shown in Fig. 8, installed on the remote control, ESP32-Master firmware combined button control and SCR into one framework -with subprograms, headers, and supporting libraries, namely, ESP NOW master register peer & ESP NOW slave callback, Freertos Xtask with semaphore, pulse width modulation (PWM) servo, analog to digital (ADC) compiler, The ESP32-Slave, as shown in Fig. 9, was mounted on the servo controller in this case as a substitution of flexion and extension movement patterns for the Exo-glove poly using a hand prosthetic and servo mechanism from proto 1 [95], using underactuated pulley mechanism with DC servo as artificial MCP joint or flexion and extension to articulate finger movement which has similar as exo-glove poly servo mechanism [96,97,98,99].The final firmware will be built with C++ programming with a .cppextension like a pre-trained model design.The ESP32-Slave will be a receiver, that will callback data threshold sent by the ESP32-Master, then control HIGH or LOW for the 4 GPIO output pins connected to the each 4 servo running in series, based on the threshold activation xtasks, respectively.

A. Spectrogram Image Result
Data in this study are shown; only the initial and the final processes of the STFT mechanism, which are signals in the time domain, and the results of STFT signal transformation for the features model in a total time of 1 second.As shown in the graphs below, feature representation as a spectrogram was computed without downsampling, and it can be identified by spectral pattern or from the pixel image value for the ML model.Characteristics of each signal will be different by determining the value of the envelope amplitude or frequency bin on each signal so that the convolution layer, BPNN learning in the MLP layer, or fully connected until the output layer will get a prediction of neuron weight in target class its respective.
Extracted Speech on the word "Forward" is shown in Fig. 10.The high-amplitude activity is in the 50-2000 Hz bin frequency interval at 200-550ms when viewed from the spectrogram image.The second speech sample is in the frequency interval bin 50-2000 Hz at a time of 200-850ms.There is a slight difference in the magnitude of the amplitude with respect to time, but it still looks identical.Meanwhile, other amplitude activities at different times also represent the word "forward" characteristics when computing the spectrogram to be used in modeling.So that when compared between 2 samples of "forward" on the plot of spectrogram-1 and spectrogram-2, it is visually different and able to spot even without using ML modeling.The characteristics of the sound signal character can be seen in the output shape.The first sample of the dysarthric vocal speech ''eee'' shown in Fig. 11 can't be easily distinguished by a pattern rather than the non-dysarthric speech vowel.Fig. 11 is the pattern looks constant because single phonemes are pronounced as steady or continuous at a particular frequency level, with a vocal intensity that can be aligned, depending on the speaker's condition.However, the amplitude activity in the first speech, produce the highest frequency in the bin frequency interval of 10-256 Hz at 200-800 ms.The second speech sample is at the bin frequency Bambang Riyanta, Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device interval of 10-200 Hz at 20-80 ms.There is a slight difference in the amplitude at a particular frequency indicates the intensity of the speaker, but identical amplitude activity can be seen between frequencies of 256Hz and 3kHz.
For dysarthric vowel speech ''iii'' as shown in Fig. 12, the high-amplitude activity is in the 50-400 Hz bin frequency interval at 200-800 ms.The 2nd speech sample is in the 50-400 Hz bin frequency interval at 200-800 ms.Pattern for Non-dysarthric speech can be distinguished without scaling at a particular frequency, but dysarthric vowel speech has a distinctive identity at each bin frequency when looking at specific logarithmic scaling, as shown in Fig. 13.III value as accuracy, and loss, while the validation data is "validation-loss" and "validation-accuracy", and it states that the modeling with the architecture used in addition to getting the highest of accuracy is 94%, with the smallest loss value of 3.2%, meaning that the modeling on the training results seems does not experience overfitting.This phenomenon is declared overfitting in the testing process with identical new data, or the model is too memorized from the speaker's subject, or this phenomenon is also often called audio fingerprinting.However, for nondysarthric speech utterances with better quantitative and qualitative datasets taken directly from the TensorFlow library dataset or speech _data v2, the accuracy of the prediction results is very relevant in the 93-97% range with 1-7% spatial data.
The accuracy comparison model shown in Table IV, with the proposed architecture and limited dataset, the result shown a reliable performance.However, this proposed model could accept dysarthic vowel speech and non dsyarthic command word on this study implementation, and perform well according table shown in the next following section.

D. TinyML Controller Speech Test Results
The speech test was conducted with the noise measured with a sound pressure level (SPL) meter software, or decibel meter using the "decibel x" apps available on Playstore, with sensibility performance and calibration of noise in a relatively representative indoor scenario, which ranges from 24db, 42db and 62db as shown in Table V, Table VI, and Table VII.The distance and position on this test were around ± 25cm from the sound source with a speech intensity ranging from 60-70db.Proto-1 will perform flexion and extension servo mechanism activities as a substitute for the Exo glove during a duty cycle or fully opening and closing hands sequence.In the first test, the noise level of about 24dB was tested in a closed room scenario and the system had a good performance with level accuracy ranging and consistent around 85-100%.In the second scenario, which is about 42dB of noise generated by environmental or surrounding noise around the house, such as motorbikes, birds chirping, kitchen activities, and other natural noises.The accuracy of speech recognition decreases around -10% level from the latest value, and the noise scenario is 62db which was tested among a crowd of conversations, and the performance decline is up to -70% so that the design SCR system in this research is temporarily better if applied to noise levels ranging from 20-42 dB.

E. Button Test Results
Testing for button controller with an on/off sequence, which moves every 5 increments, or 5 degrees on the Proto -1 servos, then observes its activity in a serial monitor, along with the response of the data packet transmission, measured on each button pressed for 1-meter range.The measured time response from the controller was about 45ms per increment, and packet size transmission activity was observed for a total of 1000ms period as shown in Table VIII

F. Range of Motion (ROM) Proto-1
Fig. 16 is the ROM activities in this study were carried out to indicate projection performance of flexion and extension activities using hand prosthetics and validate the implementation of a multi-control system with servo performance for grip activities on selected objects according to the study [9].V. CONCLUSION Speech recognition system for dysarthric and nondysarthric uses the CNN-STFT technique to represent the sound signal captured by the microphone sensor, then processes it into a spectrogram log image, which will be classified into image patterns by the CNN algorithm.The total dataset used was 102,124 recordings, combined with the speech_command v2 dataset and a small dataset for dysarthric speakers.Testing Accuracy for dysarthric and non-dysarthric speech control is quite good, ''forward'' speech get 95% accuracy, for ''Backward'' speech 97%, for vowel dysarthric speech "eee" and ''iii'' both have overfitted phenomenon 100% (overfit) cause of lack quantity, quality and augmented method for a small dataset but in real life testing, had a quite good performance.Implementing neural net modeling on ESP32 with the INMP441 mic sensor module as voice signal input was conducted with three scenarios of noise levels 24db, 42db, and 62db, each tested 20 times, and produced an average accuracy of 95%, 85%, 50% respectively.The results of the wireless data distribution response with ESP-NOW using a button control, the on-off sequence for the remote was about 60ms, while for the full duty cycle it was about 800ms.For hardware or electrical circuit design, there is a bouncing phenomenon in momentary button, use filters in the circuit to reduce this phenomenon, adding an H-bridge for the efficiency of servo output current, and using PID or fuzzy logic control for the safety open loop mechanism of the servo.In feature engineering or dataset engineering, overfitting occurs in modeling, especially in large dysarthic spoken language datasets, so increasing the quantity and quality of datasets, such as using datasets from severe subjects or related open datasets, and performing preprocessing variations like using mel-spectrogram, log mel spectrogram, L2M spectrogram, Chroma and others, can be used as a comparison for upcoming research.

Fig. 3 .
Fig. 3. Time domain voice signal transformation with STFT with hamming window

Fig. 14 .
Fig. 14.(a) The proposed model training accuracy (accuracy vs epochs).(b) Model loss (loss vs epochs)C.Testing 2DCNN Modelling ResultAs mentioned in the dataset preparation sub-page, the training, validation, and testing data composition was divided into a ratio of 80:10:10%.However in this research,

Fig. 15 .
Fig. 15.The confusion matrix of testing model results.The y-axis represents the true label and the x-axis represents the predicted label for SCR classes

Fig. 16 .
Fig.16.Photographs ROM Activities, demonstrate Proto-1 grasping objects of various shapes.(a-d) Grasping a Qur'an, grasping a metal mug, grasping a deodorant, and grasping a comb Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device machine, then inference with ESP32 which relatively low cost and powerful to use for TinyMl embedded

TABLE I .
COMMAND WORDS AND DATASET RATIO

TABLE III .
OBTAINED ACCURACY AND VALIDATION ACCURACY OF PROPOSED 2DCNN MODEL

TABLE IV .
TRAINING ACCURACY COMPARISON FROM PROPOSED METHOD WITH EXISTING LITERATURE

TABLE V .
SPEECH TEST WITH NOISE LEVEL ±24DB -ENCLOSED ROOM

TABLE VI .
SPEECH TEST WITH NOISE LEVEL ±42DB -HOUSEHOLD

TABLE VII
. SPEECH TEST WITH NOISE LEVEL ±62DB -CROWD IN THE ROOMCommand Word No of test True Percentage (%)Vowel ''iii'' and TableIX.The button bounce phenomenon was recorded by looking at packet size activity per 1000ms and had no significant bounce.The number of buttons pressed for movement of flexion or extension until it is intact, or 100% duty cycle measured, was 12 times, and the maximum bounce was around at 20 increments or 4 times the trigger button sequence but does not interfere with high spike, Bambang Riyanta, Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device peaking, or ripple on servo voltage.Meanwhile, the limitation of data transmission parameters measured in this study is the number per frame/complete packet, which is 100-111 packets/second.

TABLE VIII .
WIRELESS BOUNCING BUTTON TEST -FLEXION SEQUENCE