Modified Q-Learning Algorithm for Mobile Robot Path Planning Variation using Motivation Model

— Path planning is an essential algorithm in autonomous mobile robots, including agricultural robots, to find the shortest path and to avoid collisions with obstacles. Q-Learning algorithm is one of the reinforcement learning methods used for path planning. However, for multi-robot system, this algorithm tends to produce the same path for each robot. This research modifies the Q-Learning algorithm in order to produce path variations by utilizing the motivation model, i.e. achievement motivation, in which different motivation parameters will result in different optimum paths. The Motivated Q-Learning (MQL) algorithm proposed in this study was simulated in an area with three scenarios, i.e. without obstacles, uniform obstacles, and random obstacles. The results showed that, in the determined scenario, the MQL can produce 2 to 4 variations of optimum path without any potential of collisions (Jaccard similarity = 0%), in contrast to the Q-Learning algorithm that can only produce one optimum path variation. This result indicates that MQL can solve multi-robots path planning problems, especially when the number of robots is large, by reducing the possibility of collisions as well as decreasing the problem of queues. However, the average computational time of the MQL is slightly longer than that of the Q-Learning.


INTRODUCTION
Agricultural technology is rapidly advancing towards the Agriculture 4.0 paradigm.Agriculture 4.0, in [1] chapter 2, refers to the use of artificial intelligence, big data, Internet of Things (IoT), and robotics to increase the efficiency of activities in agricultural production activities.Javaid et al in [2] mentioned the importance of implementing robotics in smart farming.However, the change from traditional technology to automated devices provides opportunities and challenges [3], [4], including the use of agricultural robots [5]- [11].Oliveira et al [5] showed the notable advances in mobile robotics and the advantages of investing in technologies.The development of agricultural robotic systems will continue to increase their efficiency and robustness.Another research has also been involved to get solutions for navigation problems on a mobile robot in agriculture [12]- [15].
Autonomous navigation is an important aspect in the field of agricultural robots [12], [16], which covers four key requirements: mapping, localization, motion control, and path planning.Path planning is an essential issue in robotic problems.This task revolves around identifying rotational actions and a series of translations to move from the initial position to the goal while avoiding obstacles [17].The exploration of robotic path planning is a critical area of investigation in the field of robotics, including in the use of mobile robots in agricultural settings [6], [18].
The Q-Learning algorithm is one of the reinforcement learning algorithms that is currently employed in path planning.It is a classical reinforcement learning algorithm that has been implemented in several studies for producing optimum path [68]- [75].It is frequently used for path planning on moving robots [69]- [72], [75], [76].In general, research studies indicated that the benefit of Q-Learning is that it always produces an optimum path.However, the drawback, when the Q-Learning algorithm is applied to several robots in the same area with the same task, the resulting paths tend to be relatively the same.Hence, these paths have the potential for collisions between robots when all robots move simultaneously.
The objective of this study is to modify the Q-Learning algorithm by utilizing a motivation model to generate diverse yet optimum path options for multiple mobile robots in the same area.Previous studies have used motivation models in the algorithms to influence agents in making decisions [77]- [79].In this study, the achievement motivation model [77] is incorporated in the Q-learning algorithm to produce variations of optimum path.We call this algorithm Motivated Q-Learning (MQL) algorithm.By having more than one optimum path, in the case where communication between robots is not available, the possibility of collisions can be reduced.In addition, in the case where the robots can Journal of Robotics and Control (JRC) ISSN: 2715-5072 697 Hidayat, Modified Q-Learning Algorithm for Mobile Robot Path Planning Variation using Motivation Model communicate with each other, the queuing problem can be decreased.

A. Reinforcement Learning (RL) and Q-Learning Algorithm
Reinforcement learning (RL) is one method in machine learning.It is a method of taking action based on the reward [80].The rewards and penalties concepts are used to explore an environment.Five important terms are used in the Q-Learning algorithm, namely agent, state, action, reward, and penalty.In this case, the agent is a mobile robot as an object that moves in the environment.The position of the agent in the environment is represented by state (S).The action (A) represents movement the agent from one state to another state.Rewards are positive values that are given if the agent takes the correct action, while penalties are negative values that are given if the agent takes the incorrect action.Through exploration and exploitation, the agent gains experience.The exploration allows the agent randomly to visit all state-action pairs in the environment without considering the current state.On the contrary, exploitation maximizes the reward from the current state using the agent's acquired knowledge to select actions.One type of RL method is the Q-Learning algorithm [81].
On the Q-Learning algorithm, the Q values are stored in a two-dimensional Q table for each state and potential action.The algorithm chooses the action with the highest reward.The equation ( 1) is the Q-Learning equation by Watkins [80] to update the Q value.
The position of agent (A) or state at time t is represented by   .The agent action in state   is represented by   .The  +1 is reward value that received after the agent executes action  +1 in state   .The (  ,   ) is generated by action   in state   .The discount factor () serves as a variable determining the significance of upcoming rewards.Its value ranges between 0 and 1.A value near 0 implies that the agent prioritizes immediate rewards, while a value near 1 signifies the agent's consideration of future rewards.The learning rate (), ranging from 0 to 1, affects the pace of achieving convergence.When  is close to 0, convergence takes a long time.Conversely, higher values prompt the agent to make drastic adjustments to the Q value, hindering convergence due to fluctuating outcomes.The pseudocode for Q-learning is presented in Algorithm 1. Loop for each step in the episode: 5:

Algorithm 1 Q-Learning Algorithm
Select  from  by using policy from  6: Take action , observe ,  ′ 7: (, )(, ) + [R +    ( ′ , ) -(, )] 8:   ′ 9: Until  is target Initially, all values of (, ) in the Q-table is set to zero.The  and  refer to the state and action, respectively, which are elements of the entire state space ( + ) and all possible actions of that state ().Then, initial state S is determined.The Q value is updated in the looping section.
During the iterative procedure, an action (A) is chosen for execution in the current state (S) based on the policy derived from Q-values.Following this, the agent selects an action (A) and observes both the reward (R) and the subsequent state (S').The Q-value in the Q-table is then updated using equation (1).Additionally, the current state (S) is set to the value of the next state (S').This looping process persists until the current state matches the target state.Fig. 1 illustrates the process of Q-Learning.The state   is denoted as the initial state (n), and the feasible actions (A) are obtained from Q using the expression (( +1 , ))  .This selected action transitions the agent to the subsequent state ( +1 ), acquiring a reward value ( +1 ) in the process.This sequence continues until convergence is achieved.

B. Achievement Motivation Model
The motivation model is a model that can be applied to agent intelligence to help identify, prioritize, choose, and adapt to targeted goals.The application of a motivation model in path planning algorithms can help mobile robots move according to the given motivation, resulting in different path variations in the same area and goal.One of the motivation models proposed by Merrick and Shafi is achievement motivation [77].This motivation can be defined as the need for success or achievement of excellence.According to [77], achievement motivation is based on the estimation of the probability of success and the difficulty of the task, which is modeled by equation (2).
This model has six parameters   (),  ℎ + ,  ℎ − ,  ℎ + ,  ℎ − , and  ℎ .  () is the subjective probability of successfully achieving the goal . ℎ + is the sigmoid turning point for approach motivation, and  ℎ − is the sigmoid turning point for avoidance motivation. ℎ + is gradient for approach and  ℎ − is gradient for avoidance.Finally,  ℎ is a measure of the relative strength of achievement motivation.
When the approach turning point is to the left of the avoidance turning point (i.e.,  ℎ + <  ℎ − ), the resulting tendency represents individuals who are motivated to succeed. ℎ + >  ℎ − represents individuals motivated by failure. + > 0 represents the gradient of approach to success, while  − > 0 represents the gradient of avoiding failure.The  ℎ value can be used in the development or modification of artificial intelligence algorithms to influence decision-making processes.Determining the value of the ISSN: 2715-5072 698 Hidayat, Modified Q-Learning Algorithm for Mobile Robot Path Planning Variation using Motivation Model variables can determine the value of the expected motivational tendency.In this case, is the tendency of achievement to avoid collision.

C. The Proposed Method
The proposed modified algorithm is presented in Fig. 2. The reward achievement ( ℎ ) is used to affect the update of the Q-value.  is the new reward value from the initial reward (  ) added to the reward achievement ( ℎ ).The  ℎ value is influenced by the probability value (P),  ℎ value, and the K value, and also the  ℎ value.Based on equation ( 3), P is proportional to  ℎ .This means that the greater the value of P, the greater the value of  ℎ .However, because  ℎ is negative (to model obstacles), the larger P, the more negative  ℎ .The greater the K value, the more negative  ℎ .K value and  ℎ value are used to affect the size of the  ℎ value.In practice,  ℎ is used to update   on the state used in the previous path.The more negative  ℎ , the stronger the state condition which is considered as an obstacle or a state that cannot be passed, so that the next agent is expected to find a new path as a path variation.Therefore, equation ( 4) shows how the update in Q-value in the marked state and the update in Q-value in the normal state (unmarked) are calculated.
Fig. 3 shows the flowchart of the MQL algorithm for finding path variations based on the utilization of the motivation model.In the first route search, the initial reward value (step a) is used to update the Q value.After the process of updating the value (step b) in the Q table is completed, the agent will search for a route from the starting point to the target point (step c) based on the value in the Q table.In step d, if the first path (route 0) is found by the agent then the agent will save the path as route 0 (step e) and continue searching for the second path (route 1) by considering route 0, but if the route 0 is not found then the algorithm will inform that the route was not found (step f).In searching for the second path, the reward value in the state in route 0 will be updated (step g) using equation (4).Each reward on the state (route 0) will be added with the  ℎ value.Then the algorithm will execute steps h, i, j, and k as well as steps b, c, d and e for route 1.If route 1 is not found then the algorithm will go to step f to inform that route was not found.Likewise for the search for the next route variation, the reward value in the state (for example, route 1) will be updated by adding the  ℎ value to the old reward (step l).Then the algorithm will execute steps m, n, o, and p as well as steps b, c, d and e for route 2. If the path search does not find the target point, then the search will be terminated with a path not found notification.The pseudo code of MQL is shown as in Algorithm 2.    ′ 14:

Algorithm 2 MQL Algorithm
Until  is target 15: Save  16: Update  by adding  ℎ to  17: Until the route is not found The MQL procedure is developed from the Q-Learning procedure.In the MQL procedure, we add variables  ℎ + ,  ℎ − ,  ℎ + ,  ℎ − , and  ℎ , , ,  ℎ to produce the  ℎ and the  ℎ .The  ℎ will be added to reward  in the state used as the previous path.
Simulations were conducted in three areas with different obstacle conditions to determine the performance of the proposed method.Measurements in this research are the number of path variations, computation time, and the number of rewards for each path as well as the value of similarity between paths.In addition, a comparison was made on the Q-Learning algorithm.

III. RESULTS AND DISCUSSION
A. The Change in the Value of  ℎ Based on , , and  ℎ The change in the value of , ,  ℎ and  ℎ have a significant impact on the value of  ℎ .The given values for variable  range from 0.1 to 1 with a step of 0.1, the value of  ranges from 5 to 50 with a step of 5, and the value of  ℎ ranges from 1 to 3. Meanwhile, the value of  ℎ is obtained based on the variables of the motivation model and the value of .Change  ℎ value based on , ,  ℎ are shown in Table I, Table II, and Table III respectively.The lowest  ℎ is -319.43,while the largest value is -2.94 (at  ℎ = 1), -0.29 (at  ℎ = 2) and -0.03 (at  ℎ = 3).
In addition, the graph showing the changes in the value of  ℎ , which is influenced by changes in the values of , ,  ℎ and  ℎ , is displayed in Fig. 4, Fig. 5 and Fig. 6, respectively.The graph shows that as the values of  ℎ ,  and  increase, the  ℎ value decreases.At  value is 5, the decrease in  ℎ with respect to  is not significant.The lowest value of  ℎ occurs when  is increased up to K=15, reaching -95.83.The changes in r_ach that are close to linear occur at  ℎ values of 1 and 2. Meanwhile, when  ℎ values are 3, significant changes in  ℎ occur starting from  =0.5.
These results showed that, in accordance to equation ( 3), the larger the value of , the greater the influence of the constant on the probability of obtaining a larger reward.However, if the  ℎ value is negative, increasing the  value of will weaken the  ℎ value.The greater the  ℎ value, the greater the influence of the probability on the  ℎ value.If the  ℎ value is negative, increasing the  ℎ value will also weaken the  ℎ value.If the value of  ℎ is negative, the greater the divisor in the equation, the smaller the  ℎ obtained.In its application, the value of  ℎ is utilized to update the reward value of the state that has been used in the previous path.The more negative the  ℎ value, the stronger the state condition which is considered as an obstacle, so that the possibility of a collision is avoided.

B. The MQL Simulation
The MQL algorithm was simulated on a computer device with an Intel Core i5-3570 processor, clock speed of 3.4 GHz, and 4GB of RAM.The software used was Jupyter Notebook with Python 3.9 programming language.The simulation was conducted in an 11x11 area (121 states) with several scenarios, i.e. an obstacle-free area (scenario 1) and an area with obstacles i.e. scenario 2, and scenario 3. The values of learning rate () and discount factor () that used were 0.9.The iteration used was 5000.The values of achievement motivation model variables that used were  ℎ + = 0.7,  ℎ − = 0.3,  ℎ + = ℎ − = 2, and  ℎ = 1.These values will give a tendency to avoid the failed.That means agent or robot will avoid obstacles or states that have been used by another agent.The value of P was alternated from 0.1 to 1.The testing was conducted by providing the values of , , and  ℎ to produce the values of  ℎ and  ℎ .The initial reward value for each passable state was -1, and the reward value for the target state was 999.Meanwhile, the reward value for the obstacle state was -100.The rewards used is defined as (5).Meanwhile, the reward value for the obstacle state was -100.In the simulation, the initial state was marked with green color, the target state was marked with orange color, and the obstacle state was marked with black color.Four paths are searched according to the four directions of agent movement i.e. forward, backward, left and right.In  Furthermore, the similarity of the states on path variations was measured by the Jaccard similarity [82], [83] using equation (6).The A and B variables represent the sequence of states on route A and route B. The total states that are similar between A and B divided by the number of states in A and B.

𝑠𝑖𝑚(𝐴, 𝐵) = 𝑇ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑜𝑓 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑡𝑎𝑡𝑒𝑠 𝑜𝑛 𝐴 𝑎𝑛𝑑 𝐵 𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑎𝑡𝑒𝑠 𝑖𝑛 𝐴 𝑎𝑛𝑑 𝐵 × 100%
(6) 1. Scenario 1 Scenario 1 simulates path planning a single agent and single target in an obstacle-free 11×11 area.Q-Learning algorithm simulation result produced four paths (each 17 states).However, all paths tend to be similar.In contrast, MQL simulation can produce several path variations.Table IV shows the detailed results of MQL simulation, which was run with different variations of  ℎ , P and K. Here, we calculate the maximum number of paths with Jaccard similarity = 0%, which we call "safe" path variations.At  ℎ = 1, two safe path variations were produced in 1 simulations, three safe path variations were produced in 44 simulations, and four path variations were produced in 55 simulations.At The similarity values for the states traversed by the routes were calculated using Jaccard similarity.The similar states between routes were counted and divided by the total number of states between states (excluding the starting and target states).The similarity value indicates the existence of similar states and the potential for collision between routes.Table VI shows the detailed similarity index values.The average similarity value of the Q-Learning simulation results is 61.17%.It shows potential collision between routes.The highest similarity occurs between routes 2 and 3 i.e. 100%.In contrast, the average similarity value of MQL simulations is 0%.It indicates no potential collision between routes.

Scenario 2
In scenario 2, a simulation is conducted using an area with seven rectangular obstacles.The Q-Learning algorithm simulation resulted four routes and each path consists of 17 states.However, all paths are highly potential colliding.In contrast, the MQL algorithm simulation can produce several safe path variations without potential collisions (see Table VII).The recapitulation graph of the number of the safe path variations is shown in Fig. 9.

Scenario 3
Scenario 3 is conducted in an area with randomize obstacles.Q-Learning simulation is shown in Fig. 12a.The MQL results show in Table X.The recapitulation graph of the number of the safe path variations is shown in Fig. 11.Fig. 12 shows the example routes from QL simulation (Fig. 12(a)) and MQL simulation (Fig. 12a).Example path safe variations in MQL created in the simulation with P = 0.7, K = 30, and  ℎ = 2).Meanwhile, the average path-finding computation time on the Q-Learning algorithm is 1.472  0.14 seconds (95% confidence level), while the average pathfinding computation time on the MQL algorithm is 1.919  0.71 seconds (95% confidence level).Thus, the computing time of MQL algorithm is slightly longer than Q-Learning algorithm.The average reward for each path in the Q-Learning simulation results is 983 while in the MQL is 981.The difference in the average reward is only 2 points.Simulation results (scenario 3) is shown in Table X and Table XI.
In principal, the simulation results show that MQL can be applied to several robots with a same task, operating in the same area.However, the algorithm can only provide a maximum of four path variations, due to the assumption that the robot can only move forward, backward, left and right.In the real implementation, the robot may have more flexibility to move to other directions.Further study is required to analyze whether this additional flexibility will result in more path variations.In addition, the parameters in achievement motivation may need to be re-evaluated for this purpose.

IV. CONCLUSION
We have presented the MQL algorithm that utilizes a motivation achievement to find safe path variations in an unknown environment.The achievement motivation succeeded in influencing the reward value in the state that is used as a path before.This update reward makes the state as an obstacle so that the MQL will avoid that state and find other states for a new route and avoid collision with the last paths.The simulation results show that the MQL algorithm generated 2 to 4 safe path variations (Jaccard similarity = 0%).On the contrary, the Q-Learning algorithm tends to produce the same path for each robot so that is potential collisions.However, the computation time of MQL is slightly longer than Q-Learning.In principal, the simulation results show that MQL can be implemented to multi robots with a same goal in the same area.We hope MQL can solve multirobot path planning problems by reducing the possibility of collisions as well as decreasing the problem of queues.It can only provide maximum 4 path safe variations because the robot can only move in 4 directions i.e. forward, backward, left and right.Further study is needed to add flexibility robot movement to other directions (i.e.forward left, forward right, backward left and backward right) and to analyze whether this additional flexibility will result in more path variations.

Fig. 4 .
Fig. 4. The graph of  ℎ value for  and  changes in  ℎ = 1

Fig. 5 . 2 Fig. 6 .
Fig. 5.The graph of  ℎ value for  and  changes in  ℎ = 2 addition, four routes (simulation results) would be shown in different colors (black = route 0, blue = route 1, brown = route 2, and red = route 3).Q-Learning Algorithm for Mobile Robot Path Planning Variation using Motivation Model Examples of paths produced by Q-Learning simulation and safe path variations generated in the simulation (with  = 0.7,  = 25, and  ℎ = 2) are shown in Fig. 8(a) and Fig. 8(b), respectively.Even though the four paths MQL simulation have different lengths, all of them do not have a potential collision.Meanwhile, the average path-finding computation time on the Q-Learning algorithm is 1.170  0.04 seconds (95% confidence level), while the average pathfinding computation time on the MQL algorithm is 1.356  0.21 seconds (95% confidence level).The computing time on MQL is slightly longer than Q-Learning algorithm.The average reward of Q-Learning simulation results is 983 while in MQL is 981.The difference in the average reward is only 2 points.Table V shows the simulation results data in scenario 1.

Fig. 10 .
Fig. 10.An example of the scenario 2 simulation route with seven rectangular obstacles

Fig. 11 .
Fig. 11.The number of safe variation path on the area with randomize obstacles

Fig. 12 .
Fig. 12.An example of the scenario-3 simulation route with randomize shaped obstacles

TABLE I .
THE CHANGE IN THE VALUE OF  ℎ AT  ℎ = 1

TABLE II .
THE CHANGE IN THE VALUE OF  ℎ AT  ℎ = 2

TABLE III .
THE CHANGE IN THE VALUE OF  ℎ AT  ℎ = 3

TABLE V .
DATA FROM THE SIMULATION RESULTS IN SCENARIO 1

TABLE VIII .
DATA FROM THE SIMULATION RESULTS IN SCENARIO 2

TABLE X .
THE NUMBER OF SAFE PATH IN THE MOTIVATED Q-LEARNING ALGORITHM SIMULATION INSCENARIO 3

TABLE XI .
DATA FROM THE SIMULATION RESULTS INSCENARIO 3