Towards Controlling Mobile Robot Using Upper Human Body Gesture Based on Convolutional Neural Network

— Human-Robot Interaction (HRI) has challenges in investigation of a nonverbal and natural interaction. This study contributes to developing a gesture recognition system capable of recognizing the entire human upper body for HRI, which has never been done in previous research. Preprocessing is applied to improve image quality, reduce noise and highlight important features of each image, including color segmentation, thresholding and resizing. The hue, saturation, value (HSV) color segmentation is executed by utilizing blue color backdrop and additional lighting to deal with illumination issue. Then thresholding is performed to get a black and white image to distinguish between background and foreground. The resizing is completed to adjust the image to match the size expected by the model. The preprocessed data image is used as input for gesture recognition based on Convolutional Neural Network (CNN). This study recorded five gestures from five research subjects in difference gender and body posture with total of 450 images which divided into 380 and 70 images for training and testing respectively. Experiments that performed in an indoor environment showed that CNN achieved 92% of accuracy in the gesture recognition. It has lower level of accuracy compare to AlexNet model but with faster training computation time of 9 seconds. This result was obtained by testing the system over various distances. The optimal distance for a camera setting from user to interact with mobile robot by using gesture was 2.5 m. For future research, the proposed method will be improved and implemented for mobile robot motion control.


INTRODUCTION
Human-Robot Interaction (HRI) is a field of research on how to develop systems that allow robots to interact with humans in human environments.HRI has the challenge of developing nonverbal interactions with a natural approach that must be carried out in real-time.Several previous studies had been carried out for HRI investigations [1].One of problems required to solve for the mobile robot is how to control position and orientation of its wheels [2].Control technology allowed human to manage the robot movement by using joystick [3]- [14] and android-based smart device [15]- [32].Another approach to control the robot motion is by utilizing computer vision technology [33]- [37].
Computer vision technology has been emerged to be solution for developing motion control of a mobile robot based on perception of feedback from camera.Computer vision that used in robotics are known as robot vision have been explored in some previous researches to solve position and orientation control of mobile robot.Kinect sensor was investigated in [33] to control the motion of omnidirectional three wheeled vacuum cleaning mobile robot by interpreting distance and heading angle based on RGB-D images.Other study on motion control of an omnidirectional mobile robot was utilized the landmarks of the environment [34].Linear and angular velocity of mobile robot was controlled by proposing distance estimation and landmark recognition based on RGB and depth images from Kinect sensor.Work on mobile robot orientation estimation by utilizing progressive probabilistic Hough transform with law of cosines, quadrant principle, and voting mechanism was explored in [35] to control its turning movement.In other side, controlling translation motion of mobile robot was investigated in [36] by exploiting RGB and depth image matching.RGB-D images from depth camera were placed in the side of mobile robot.
Robotic vision in special forms with a focus on gesture recognition has developed into a state of the art of HRI researches in [38]- [120].Application of gesture recognition involves virtual reality [121], augmented reality [122], bio medics [123], and robotics [37].Robot control based on gesture recognition have been investigated in [37], [124]- [126].The movement of robot arm manipulator SCORBOT-ER 9 Pro was controlled in [37] by detecting human body landmarks.The joints of human arm were proposed as landmarks to enable ability of robot arm to imitate the user action.Geometry algorithm in the form of law of cosines was exploited to provide the angle of each robot's joint.Gesture was recognized by detecting the landmarks represented by a configuration of joints in human arm.
In recent years, Convolutional Neural Networks (CNN) have experienced rapid development and made a positive impression on the field of image classification [127] and robot vision [128].In the work [129], CNN was used for image recognition of the US postal handwritten digit dataset.Researchers carried out a series of processes to improve the performance of the Faster R-CNN algorithm to recognize hand posture from NUS dataset [130].In that work, gesture was constructed by using one hand with complex background with human noise.For controlling mobile robot motion based on human intention, it is desirable to investigate a natural way of interaction.This research has objective to investigate a gesture recognition system to identify the entire human upper body, not just one hand posture as in previous research, in order to move mobile robots by using natural interaction signal.
Based on challenge to improve the simplicity of humanrobot interaction, this research aims to utilize nonverbal and natural interaction.Three contributions of this study are developing of upper body gesture dataset consist of 450 images recorded from five research subjects with different gender and body posture, preprocessing using hue, saturation, value (HSV) color segmentation and the architecture of CNN for gesture recognition.

II. METHOD
This study manages its research methodology according to the flow such as shown in Fig. 1.It begins with the first contribution in this study in that the developing of gesture dataset consist of images of upper body of human.

A. Development of Gesture Dataset
There are 450 images which have proportion 80% and 20% for training and testing respectively.This gesture dataset is recorded from upper body gestures of five person with different gender and body posture that classified into 5 motions.The recording images process begins with taking a picture of body gestures by using a webcam and adjusting in environment such as described in Fig. 2.Then, preprocessing based on HSV color segmentation is applied to images of the upper body human gesture dataset.

B. Preprocessing
The preprocessing sequential processes aim to improve image quality by reducing noise and highlighting the important features of image This study proposes a series of processes consisting of color segmentation of background image, thresholding, and reducing the size of the image to match the input data model such as depicted in Fig. 3.The RBG image recorded by webcam is converted to HSV.The blue color backdrop is utilized to facilitate color segmentation on HSV image.The HSV trackbar is used for searching of background color.The brightness level of the light at the time of testing is determined by the range of minimum and maximum values for hue, saturation, and value.The image change from RGB to binary in preprocessing phase is illustrated in Fig.  Labelling is carried out after the image data has been obtained.For supporting the model to learn and identify patterns or relationships between input and output features appropriately, the correct labels must be provided.To be more precise, data labelling facilitates the model to learn the rules and correlations existing in the data.Data labelling is used to save preprocessed image into each class directory.Python code for data labelling implementation is described in Fig. 5.

C. The Architecture of CNN
At this stage, the design of a CNN system for body gesture classification is discussed.CNN is a type of Deep Learning architecture that is effective in processing image data with high pattern recognition capabilities.The CNN system flowchart will be explained in detail in Fig. 6.Based on Fig. 6, the CNN process begins by reading the training data.Then the webcam captures image data from the operator's gestures and a series of preprocessing is carried out to adjust the data type to the CNN model created.The feature extraction process is carried out according to the training model that has been created previously.
In creating a CNN network architecture, there are several parameters that need to be considered.The following are several parameters of the CNN network architecture: 1. Filters Filters in CNN are also known as convolution filters.Convolution filters are filters (multidimensional data) used in convolution layers to help extract certain features from input data.The features detected are edges, curves, shapes, etc. that can be learned by CNN.There are various types of filters such as Gaussian Blur, Prewitt Filter and many more.

Padding
Padding is the process of adding zeros to the input matrix symmetrically.If observed, the output size will be smaller than the input for each filter.So to keep the output dimensions the same as the input, you need to use padding.

Strides
Stride shows how many steps are moved at each step in the convolution.By default, the stride used is 1.The filter will shift 1 pixel horizontally and then vertically.The smaller the stride, the more detailed the information obtained from an input, but it requires more computation compared to a large stride.
A simple CNN is formed from a series of layers, each layer converting one activation volume into another volume through a different function.The following are several layers in the CNN network architecture:

Input Layers
The input layer in CNN is image data represented by a three-dimensional matrix, namely height, width and number of image channels.The 3-dimensional matrix values contained in an image will then be calculated in a convolutional process.

Convolutional Layers
The convolutional layer is the first layer that changes an input with the value in the filter.At this stage a convolution process occurs based on the output of the previous layer.Part of the image will be connected to a filter based on the number of kernels used, then shifted with a predetermined stride and performs the same operation again.The process repeats until the filter successfully completes all image data.The output from the convolutional layer will be the input for the next layer.In producing an output there are influencing parameters such as stride and zero padding.

Pooling Layer
Pooling is a process to reduce the input size spatially or reduce the number of parameters with down sampling operations.The dimensions of each feature map carried out by the pooling process will be reduced but still retain important information in each data.Pooling layers can also be used to speed up the performance of the entire convolutional layer.The most common form of pooling layer is to use a 2×2 filter with steps of 2. There are two methods for pooling layers, namely max pooling and average pooling.

Flattening Layers
After the feature extraction stage is complete, proceed to the final stage in CNN, namely the classification stage.The first classification stage is flattening, which is the process of creating a 1D vector that is used to store the previous layer's output data.Input to the fully connected process requires individual features like other classifiers.So it is necessary to change the output part of the CNN which was originally in the form of a 3D matrix into a 1D vector so that it is used in the fully-connected process.

Fully Connected Layer
Fully connected layer is a basic structure in a neural network that can connect neurons in one layer to all neurons in other layers.Fully connected layers involve weights, biases and neurons These layers are used to classify images between different categories by training.This layer is usually implemented at the end of the network.This layer uses a softmax activation function in the output layer which aims for classification.Softmax activation function to solve multiclass neural network learning and image classification problems with a set of pixels as input.The output from softmax can represent the distribution of a class.Softmax calculates all the probabilities of labels with values between 0 and 1, if they are all added up they will have a value of 1.

Outputs
The output is the final layer of the CNN network carrying information that has been learned through hidden layers and providing a final value as a result.
The CNN model used is 2 convolution layers with ReLU activation function, 2 pooling layers (MaxPooling), 1 hidden layer (256), output layer with softmax activation function.After the output layer, the probability of each class will be known.The class that has the largest probability value is considered a detected gesture.The process to find out which class has the largest probability value is by sorting the probability values between classes through a program.
The architecture of the CNN model created by the researcher is shown in Fig. 7.The CNN architecture created consists of 2 repetitions of the convolution operation.
The first convolution operation uses filter 32 with a kernel of 3×3 and stride 1.Then the Rectified Linear Unit (ReLU) function is used to change the output from the operation which was originally negative to 0. Then perform image data reduction or a down sampling process with pooling.The pooling layer here is used to reduce the spatial size of the data by taking the maximum value from the 2×2 kernel.This process is repeated to obtain more optimal feature extraction.
Optimizing the feature extraction process on a CNN is done by designing and setting the CNN architectural parameters so that the most relevant features can be extracted from the image.The goal of this optimization is to improve the performance and efficiency of the model in image classification tasks according to what researchers want.The feature extraction process in CNN generally occurs repeatedly by applying many filters.In this research, the features of the image data have first been extracted through data preprocessing.There is no need for repeated feature extraction because the special characteristics of each image data have been extracted.
Before the gesture recognition process can be carried out, training of the detection system is required, the results of this training will become a CNN model.Training was carried out on 80% of gesture dataset that recorded a total of 450 images from five participants each demonstrating five gestures.Computer with AMD Ryzen 5 3500 U Quad Core 64 bit processor, Radeon Vega 8 GPU, 8 GB RAM, Realtek RTL8723DE 802.11b/g/n for networking and Bluetooth® 4.

D. Training Stage
Gesture is a movement from various poses.In this research, 5 gestures (forward, backward, stop, turn right, turn left) were chosen as case studies for gesture recognition.Before carrying out gesture recognition using the CNN method, it is necessary to carry out a training process as shown in Fig. 8.
It is known that the CNN model designed requires an input image measuring 100×100 pixels with a black and white image type.Then there is a feature extraction process such as convolution with two layers and pooling with two layers.The type of pooling used is max pooling, which takes the largest value based on the specified number of kernels.Visualization of several steps in CNN is described in Table I.The following are the feature extraction stages.

First Convolution
The image from the preprocessing results will be subjected to a convolution process.The value that is obtained from this process will be used as input for the ReLU.To find out the height and width of the output of the convolution process, it can be determined using the following equation: Where,  is size of the feature map in the output,  is the size of the input feature map,  is the padding size,  is the filter size,  is stride size.
In the first convolution process, input measuring 100×100 is then processed using a filter measuring 3×3, stride of 1 and adding padding of 1.This results in feature maps measuring 100×100.

ReLU
ReLU is a function that is used to change negative values to 0 of the output of the first convolutional layer.

First Pooling
The pooling process is a process of reducing the size of the matrix to become smaller.There are two types of pooling, namely max and average pooling.Max pooling uses the largest value of the selected matrix, while average pooling uses the average value of the selected matrix.The pooling process used is average pooling with a filter size of 2×2 and stride 2. In the first pooling process, input measuring 100×100 is then processed using a filter measuring 2×2, stride of 2 and without adding padding.It produces feature maps measuring 50×50.

Second Convolution
The image from the first pooling will be subjected to a second convolution process.Then a value is obtained, this value will be used as input for the ReLU process.In the convolution process, the two inputs are 50×50 in size and then processed using a 3×3 filter, a stride of 1 and additional padding of 1.This results in feature maps measuring 50×50.

ReLU
ReLU is a function that is used to change negative values to 0 of the output of the second convolution.

Second Pooling
The pooling process is a process of reducing the size of the matrix to become smaller.The pooling process used is average pooling with a filter size of 2×2 and stride 2. In the pooling process, the two inputs measuring 50×50 are then processed using a filter measuring 2×2, stride of 2 and without adding padding.It produces feature maps measuring 25×25.

Flattening
The flattening process is the process of changing the resulting matrix from the feature extraction process into a single vector which will later become the input layer for the CNN process.

Fully Connected Layer
Fully Connected Layer (FCL), also known as Dense Layer, is a type of layer that is generally used in CNN architectures to extract further features from features that have been extracted by previous convolution layers and pooling layers.FCL differs from convolution layers in that it has convolution operations based on kernel matrices.The FCL consists of neurons that are fully connected to all neurons in the previous layer.Based on the previous results, the values obtained are carried out further calculations and activation functions such as ReLU can be applied.

Softmax
Softmax is an activation function that serves to distribute the probabilities of existing classes.In There are five neurons that represent the class to be predicted, namely stop, forward, backward, turn right, and turn left gesture respectively.
After completing the training process, a training results file is created with the extension .jsonand .h5.These files in machine learning are generally used for storing and managing CNN training results.These two formats have different purposes, json files are used to store the structure, model configuration, and other information related to the architecture of the CNN model being created.Meanwhile, the h5 file is used to store the weights and biases in each layer produced during the model training process.This file is used in the real time gesture detection by using a webcam in testing stage such as described in the following section.

E. Testing Stage
At testing stage, the gesture detection process using the CNN method begins by reading the training results files in json and h5 files to be compared with the result of gesture recognition process of human gesture frame data via a webcam.A prediction process is performed based on the results of training and webcam.The index with the highest prediction probability results is selected as a successfully detected gesture.The results of the gesture detection as depicted in Fig. 9, will be used as control commands for the mobile robot.RESULTS AND DISCUSSION Experiments were carried out to determine the level of success of the proposed method in recognizing human upper body gestures towards controlling the movement of mobile robot.Gesture was demonstrated by human operator in front of webcam at a distance of 2.0 m as illustrated in Fig. 10.The list of names of gestures for testing is shown in Table II.It presents the overall test result of the gesture-based mobile robot movement control system, researchers carried out the test 10 times on 5 participants which demonstrate 5 gesture of test data samples.From tests that were carried out 250 times, there were 227 trials where the gesture was successfully recognized, while 23 trials where the gesture failed to be identified.The percentage accuracy of system testing in identifying gestures was 90.8%.Fig. 11 shows the result of tests that have been carried out by involving variance in gender and size of posture.Small man, big man, small woman, and big woman are four variances used in the experiments.This research extends to some experiments to examine the impact of difference distance between human users and cameras on the reliability of the proposed method.There are seven test scenarios by setting the distance between 1.0 m and 4.0 m.Table III summarize the accuracy of gesture recognition from various distances.The first scenario reported 4.4% of accuracy where the distance between the webcam and the user was 1.0 m.In the second scenario where the distance was 1.5 m, resulting in an average success of the system in recognizing gestures of 9.0%.
The third test scenario was resulting in an average success of the system in recognizing gestures of 90% where the distance was 2.0 m.The best accuracy was achieved in the fourth test scenario that resulting in an average success of the system in recognizing gestures of 92% where the distance between the webcam and the user was 2.5 m.The average accuracy of the proposed system in recognizing gestures decreased to 88% in the fifth scenario where the distance between the webcam and the user was 3.0 m.The accuracy has a lower score of 78% in the sixth scenario where the user-webcam distance was 3.5 m.The lowest average score of 2.0% resulted in the seventh scenario where the distance between the webcam and the user was 4.0 m.Of the various distances that have been tested, the optimal distance between the webcam and the user to be used to control the movement of a mobile robot is 2.5 m.
Gesture for Stop is relatively easy to identify, while other gestures such as Move Forward, Move Backward, Turn Right, and Turn Left are relatively difficult to recognize.This condition occurs because of the emergence of false positives due to noise in the image.By removing noise through preprocessing by using Gaussian filter or such other filter, it is hoped that the false positive rate can be reduced so that accuracy increases.Several CNN architectures for image denoising will also be investigated in future research.Adaptive HSV color segmentation which will be explored in the next research could also be the solution to overcome the influence of lighting and clothing color in gesture recognition.
These experiments include the scenario at the nearest distance of 1.0 m and at the furthest distance from the capture device of 4.0 m.The data distribution in scenario 4 with a distance of 2.5 m, the best accuracy in these experiments 92% has the smallest value compared, namely 4.19, compared to the data distribution in other scenarios with accuracy above 70%.This study results lower computation time of 9 seconds with 92% of accuracy.Although it has lower computation time, but it has lower accuracy than AlexNet model that scores 100%.
Based on the experimental results, it can be seen that there are still several limitations in this proposed approach in the form of limited sensor range, dependence on backdrops, HSV settings in the color segmentation section are still manual, and accuracy is still low.In the next research, some of these limitations will be corrected and the results will be implemented in the control of mobile robots in human work environments.Viewed from a qualitative perspective based on user experience, in future research it is necessary to examine several additional gestures, especially for more complex mobile robot movements.

IV. CONCLUSION
The CNN architecture created by the author succeeded in detecting gestures supported by our developed gesture dataset and HSV-based color segmentation resulted the accuracy on average by 92%.The testing process was carried out optimally at a distance of 2.5 m in front of the webcam sensor.In the detection process, lighting settings and the color of the clothes worn have a big influence on the system.It will be difficult to carry out color segmentation so that the feature map of each gesture image is not formed perfectly.A gesture image that is not formed will make it difficult for CNN to detect because of the noise that exists even though the CNN already has a filter at the convulsive stage.Tests were carried out at different distances to determine the robustness of the system from various distances as explained in the experiment stage above.Based on the comparison results, the research model has a faster model training computing time of 9 seconds with 92% of accuracy, but has a lower level of accuracy than the AlexNet model that has accuracy with score 100%.
For future research, several improvements must be developed to overcome limited sensor range, independence system of the cluttered background, adaptive color segmentation, enhance the accuracy of system, decrease the computational time, and system implementation to real mobile robot in the human environment, gaming, and assistive technologies.

Fig. 2 .
Fig. 2. Environment setting in image recording process

Fig. 5 .
Fig. 5. Code for image data labelling Muhammad Fuad, Towards Controlling Mobile Robot Using Upper Human Body Gesture Based on Convolutional Neural Network 3. Activation Function One of the important parameters in determining the output, accuracy and efficiency of the training model is determined by the activation function.The activation function can determine whether to activate a neuron with reference to the input to the network being important or not in the prediction process using simple mathematical operations.There are several examples of activation functions such as Sigmoid, Tanh, ReLU, Leaky ReLU, Parametrized ReLU, Exponential Linear Unit, Swish, and SoftMax.The commonly used activation function is ReLU.Using ReLU can speed up the training process by thresholding the zero value of the pixel values in the input image.The task of the ReLU activation function is to change pixels that are negative or less than zero in the image to 0.

Fig. 6 .
Fig. 6.Flowchart of gesture recognition 2 combo was used to complete the training process.Researchers compared the training time process with the AlexNet architecture model and found that the model created was 9 seconds faster than the AlexNet architecture.The researcher's architecture took 40.22 seconds, while the AlexNet model took 49.89 seconds when training the model.

Fig. 10 .
Fig. 10.Experiment's scenario for investigating the impact of distance

Fig. 11 .
Fig. 11.Results of several experiments with gender and the size of posture variance: (a) man with small posture, (b) man with big posture, (c) woman with small posture, (d) woman with big posture

TABLE I .
STEPS IN CNN

Stop Gesture Move Forward Gesture Move Backward Gesture Turn Right Gesture Turn Left Gesture Muhammad
Fuad, Towards Controlling Mobile Robot Using Upper Human Body Gesture Based on Convolutional Neural Network

TABLE II .
RESULT OF UPPER HUMAN BODY GESTURE RECOGNITION

TABLE III .
TESTING THE DISTANCE OF THE GESTURE DETECTION SYSTEM