Path Following and Avoiding Obstacle for Mobile Robot Under Dynamic Environments Using Reinforcement Learning

— Obstacle avoidance for mobile robot to reach the desired target from a start location is one of the most interesting research topics. However, until now, few works discuss about working of mobile robot in the dynamic and continuously changing environment. So, this issue is still the research challenge for mobile robots. Traditional algorithm for obstacle avoidance in the dynamic, complex environment had many drawbacks. As known that Q-learning, the type of reinforcement learning, has been successfully applied in computer games. However, it is still rarely used in real world applications. This research presents an effectively method for real time dynamic obstacle avoidance based on Q-learning in the real world by using three-wheeled mobile robot. The position of obstacles including many static and dynamic obstacles and the mobile robot are recognized by fixed camera installed above the working space. The input for the robot is the 2D data from the camera. The output is an action for the robot (velocities, linear and angular parameters). Firstly, the simulation is performed for Q-learning algorithm then based on trained data, The Q-table value is implemented to the real mobile robot to perform the task in the real scene. The results are compared with intelligent control method for both static and dynamic obstacles cases. Through implement experiments, the results show that, after training in dynamic environments and testing in a new environment, the mobile robot is able to reach the target position successfully and have better performance comparing with fuzzy controller.


I. INTRODUCTION
In recent years, due to their agility, maneuverability, and ability to be deployed in many complex missions, mobile robot has attracted many researchers, particularly in respect of autonomous navigation in the warehouse or restricted area [1][2][3][4][5][6][7][8][9][10]. However, methods using fixed line have many drawbacks: • After a period of use, line will be blurred and make it difficult for the robot to detect. • It is also susceptible to the impact of the surrounding environment and complicated to change the route. • It is not flexible when encountering dynamic obstacles and easily causes the robot to have very large errors with the planned path. This poses a danger when the operating robot escapes from the safety zone in the factory or workshop Nowadays, there is an expanding demand for complex applications where working environment has the participation of humans, moving robots or sudden obstacles appearance. At this time, humans or robots are considered movable obstacles that form the dynamic environment. Since then, the solution for the "Navigation" problem, which is based on camera application [11][12][13][14][15][16], has attracted much interest as it can surmount the limitations of tradition line detecting method and also provides the ability to optimize the path. Moreover, the working area is usually a warehouse or limited area, so the robot's ability to operate must be flexible and can run according to the predetermined trajectory and avoid moving obstacles that may appear without leaving from safe planning trajectory [17][18][19][20]. Therefore, there are currently numerous traditional researches and approaches such as: RRT* [21][22], A* [23][24], Visibility Graph, Fast Marching Tree, which mainly focused on autonomous path planning for mobile robot where the environment is static and mapped, and obstacle locations are assumed to be known in advance. Besides, [25][26][27][28][29][30][31][32][33][34] applied complicated algorithms such as Adaptive Genetic, Bacterial Evolutionary Algorithm, Predictive behavior or Partial swarm to avoid collisions. Those methods are time-consuming in building and updating of the dynamic environment map, these drawbacks make the low accuracy of the prediction. Traditional obstacle avoidance algorithms are invalid when the information of the obstacle is incomplete or completely unknown. The intelligent control needs data or experience to design [51][52]. Reinforcement learning (RL), unlike other artificial intelligence algorithms, is a learning method that does not require any rules [53][54][55][56]. RL is a machine learning method that regards the feedback of the environment as an input and adapts the environment. Qlearning is one of the most popular algorithms in the RL algorithm. The algorithm focuses on value-based reinforcement learning that is updated as the environment is explored by means of the Q-value function [57][58][59][60]. Recently, combining intelligent control and Q-learning are also applied [61][62][63][64][65]. However, these studies are only carried out on simulations, as well as experiments with simple static objects. Moreover, designing the controller for robot to follow the processed path (virtual) is not clearly discussed. As we known that Q-learning based on a learning experience of trial and error where the agent goes through numerous failures before actual success. it is very expensive as well as timeconsuming. This means that the training is difficult to perform in a real environment and tends to be done mostly in simulation. From [61][62][63][64][65] it can be seen that Q-learning are trainable in virtual environments and afterward transferable to the real world in robot applications. To address these challenges, in this research the training environment for the RL agent in a virtual environment is developed. The virtual environment depicts the actual scenario and enables the user to collect a large number of reactions in various environments. The agent can interact with the environment through the actions and can also be trained with various userdefined rewards and goals. After that the learned table is transfer to the real mobile robot for doing the real experiments.
The contribution of this study is that: • Unlike other traditional, artificial and intelligence algorithms, the obstacle avoidance methodology for moving obstacle in this paper does not use any prior dataset for training or experience for design the controller. • The training data are transferred to robot and work real time in the real application. • From the experiment the RL proved to have better performance than intelligent control algorithm in terms of total time and errors. Fig. 1 show the parameters of two-wheel mobile robot, where ( , ) is the goal point, ( , ) is the control point of robot, d is radius of wheels, b is the distance between two wheels and 1 , 2 , 3 are the errors between robot and goal point.

II. ROBOT MODELING
From Fig. 1, the error between the center and the goal point as shown in equations (1) and (2).
Derivative of the errors we have with two control parameters are velocity and angular velocity ω, objective of the needed controller is to eliminate the errors [ 1 , 2 , 3 ] = [0,0,0] . The Lyapunov's stability method is used for the controller.

Fig. 1. Robot modelling
Theorem: consider a system which is described by a state equation ̇= ( 1 , … , ). If exist a positive-definite ( ) with all state variables, so that its time derivative is a negative-definite function then that system is stable [66].
Choose a positive-definite Lyapunov function shown in (3) and value of derivatve.
Derivative of : Substitute and ω into (3), we have: Fig. 1 the velocity of each wheel is calculated in equation (4) and (5).

III. CONTROL DESIGN
A. Q-learning Q-learning is a modeless reinforcement learning algorithm [67][68]. The goal of Q-learning is to learn the rules, the rules that tell the machine what action to take under what circumstances. The algorithm does not require a model of the  [69][70][71].
Specifically, in case the robot approaches an obstacle, the robot will compare the relative position of the obstacle and then take an action turn left, turn right, go straight). Every time the robot passes or hits an obstacle, it will receive the corresponding reward. This process repeats until the robot finds the actions to optimize the reward received, in other words, finds the best rule to avoid the obstacle.
The relative position of the obstacle relative to the robot will be divided into 8 angles (Gi with i ∈ {1, 8}) corresponding to 8 states. Each angle Gi has a magnitude of 45 0 , starting from the direction coincident with the current direction of the robot, in the positive trigonometric direction as shown in Fig. 2. The robot calculates the angle (the angle between the robot's current direction and the line connecting the control point to the center of the obstacle) and then considers what state it is in and then gives the corresponding action for that state In the Q-Learning algorithm, the corresponding action for each state is calculated from the Q-table. The rows of the table include the states reached by the robot during the learning process, the columns are the actions that the robot performs in the respective states, each action is predefined and for the action. Initially, the value of the cells in the table is set to 0, then it will be updated gradually during the learning process. The value in each cell is calculated by the formula in equation (6).
Where, , are state and action at time t respectively, ( , ) is the value at state and action , ( , ) is the reward received when performing the action at in state , ( , +1 ) is the largest value in the row corresponding to the state at the next time in the Q-table, is the attenuation factor, with a value less than 1 to ensure that the further away from the target, the smaller the value.
When the robot is in the state , the robot will find the maximum value of the corresponding actions in that state and perform the action, then update the value at that cell ( , ). Continue until the learning process is over, the robot has learned the rules (through the Q-table) to be able to avoid obstacles in many different cases.

B. Fuzzy Control
Fuzzy Logic is a control algorithm that mimics the processing of ambiguous information and decision-making by humans [72][73][74][75][76][77]. Specifically, in case the robot approaches an obstacle, the input value includes not only greater or less than the safe distance, but also additional values such as slightly far, far, slightly close, close, between combined with predefined rules to get the corresponding output values. Number of inputs consist of angle which is the angle made by the robot's current direction and the line connecting the control point to the center of the obstacle and the variation of the angle (∆ ). The number of outputs includes robot angular speed as shown in Fig. 3. Center average is used for Defuzzification. The angle is partitioned into three fuzzy sets with the range from -60 to 60 degree and the change rate of is partitioned into three fuzzy sets from -8 to 8. The output have five fuzzy set from -20 to 20 rad/s. All the parameters are shown in Fig. 4 and Fig. 5. and three subsets ∆ , there exist nine rules, and therefore, the rule base is represented in a 3x3 matrix, as shown in Table 1.

IV. SIMULATION RESULTS
For designing the Q-learning controller, the simulation has two Phase: Training and Testing. The simulation is to investigate the ability of mobile robot to accomplish the task without hit the obstacles in different environment. The Q learning and Fuzzy Q learning is applied in Training Phase. In Testing Phase, Mobile robot is placed in another environment and three algorithms discussed above are performed and compared.
Training Phase: Mobile robot is trained to reach the target in an environment with eight static circle (black color) and six dynamic obstacles (blue color) with 10cm diameter are used. The dynamic one is generated randomly and run different velocity and trajectory as shown in Fig. 6. There are six random start-points for the training to make sure that robot can operate in different environment. The velocity of robot is assumed constant as 0.5(m/s). By using the controller discussed in Section 3, the robot will keep tracking the shortest path (dot line) to reach the target while avoiding the obstacle. After twenty-two epochs, in different positions, robot can reach the target without hitting any obstacles. The result of the training performance and learning table is shown in Fig. 7. Testing Phase: The robot reuses the map as shown in Fig.  6, but the position of the robot's starts are changed. This position change is used to create different operating environments for the robot. The robot is placed in the starting positions. Fuzzy and Q learning algorithms are performed at those locations, respectively. The performance of these algorithms are illustrated in Fig. 8 and Fig. 9. The average error between the shortest path the simulation path of each algorithm, the time average for each simulation is summarized as in table 2.  Fig. 8 and Fig. 9, it can be seen that under the same condition, the robot can complete the task with both algorithms. However, with the Q learning algorithm, the robot has better performance when avoiding obstacles, the robot has less to change direction many times, which is very important when the robot carries objects in a warehouse or working area.
From the Table 2 and Table 3 it can be seen that at every position, the Q-learning algorithm have smaller total time and the absolute error between the shortest path planning (the line connecting the start and the goal point) and the real path is smaller. Q learning algorithm helps the robot to follow the planned path better, so that the robot does not get out of the safe zone around the planned path

V. EXPERIMENT
Robot is brought to another operating environment similar to the real experiment to simulate and compare the results with the experiment. The Q-learning table which obtained in the simulation is also reused in the real experiment. The robot attached with AR marker above will operate on flat terrain, camera fixed above working space with frame rate 30fps and fixed lighting conditions to create a map as shown in Fig. 10. The robot (marked with a red "R") will move from the starting point to the destination point (marked in red), the robot's initial direction is 0o. Obstacles will be fixed-sized objects of the same height as the robot. Each obstacle is mounted on an AR marker (static obstacles are marked with red "SO", moving obstacles are marked with red "MO"). The number of obstacles includes two static obstacles and one dynamic obstacle. The starting point is marked in black; the destination point is marked in green. Static obstacles are black circles; dynamic obstacles are purple circles. The overall controller for the mobile robot is shown in Fig. 11. Firstly the AR marker on robot will give the position of the robot and the error pose between the initial location and target pose [ 1 , 2 , 3 ] are calculated. After that the velocity of the wheel robot is calculated for robot to move to the target position. on the way to the destination, mobile robot will use the Q-learning algorithm to avoid the obstacles. The working result of simulation and experiment are also shown in Fig. 12. From Fig. 12 it can be seen that, When the robot moves to the destination, the moving obstacle (pink circle) also moves to cut the mobile robot's direction, then the robot must use algorithms designed to avoid obstacles.
The performance of the controllers is showed in Fig. 13 and Table 4. The result shown in blue is the robot's trajectory. The red points are the position of the robot when performing the obstacle avoidance algorithm.   Fig. 12 it can be seen that, the response of the control algorithm in simulation and experiment is similar. Both simulation and experiment in the new environment, Qlearning algorithm always gives better results when the trajectory changes less direction.
From table 4, it can be seen that, the total time and error of Q learning is smaller. This happens because the moving trajectory of the Q-learning algorithm is smoother

VI. CONCLUSION
In this paper, the path following algorithm for mobile robot and obstacle avoidance using Lyaponov stability, and Q-learning has been implemented. The series of simulation and experiment are investigated to make sure that the obstacle avoidance algorithm can help robot to avoid obstacles without leaving the planned path. These algorithms can all be applied in factories or pre-planned areas with fixed cameras from above. Series of simulation and experiment show that the Q learning algorithm can simulate in a virtual environment before applying it in the real environment. By comparing the total time and error from Table 4, the performance for the obstacle avoidance algorithm using Qlearning controller is better than the fuzzy controller in terms of total time performing and errors with the planning path. This happens because the moving trajectory of the Q-learning algorithm is smoother For future improvement, the error between the simulation and the real experiment can be reduced by implementing visual servoing control algorithm.