Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach

—In this paper, we focus on enhancing the performance of the controller utilized in the Stewart platform by investigating the dynamics of the platform. Dynamic modeling is crucial for control and simulation, yet challenging for parallel robots like the Stewart platform due to closed-loop kinematics. We explore classical methods to solve its inverse dynamical model, but conventional approaches face difficulties, often resulting in simplified and inaccurate models. To overcome this limitation, we propose a novel approach by replacing the classical feedforward inverse dynamic block with a reinforcement learning (RL) agent, which, to our knowledge, has not been tried yet in the context of the Stewart platform control. Our proposed methodology utilizes a hybrid control topology that combines RL with existing classical control topologies and inverse kinematic modeling. We leverage three deep reinforcement learning (DRL) algorithms and two model-based RL algorithms to achieve improved control performance, highlighting the versatility of the proposed approach. By incorporating the learned feedforward control topology into the existing PID controller, we demonstrate enhancements in the overall control performance of the Stewart platform. Notably, our approach eliminates the need for explicit derivation and solving of the inverse dynamic model, overcoming the drawbacks associated with inaccurate and simplified models. Through several simulations and experiments, we validate the effectiveness of our reinforcement learning-based control approach for the dynamic modeling of the Stewart platform. The results highlight the potential of RL techniques in overcoming the challenges associated with dynamic modeling in parallel robot systems, promising improved control performance. This enhances accuracy and reduces the development time of control algorithms in real-world applications. Nonetheless, it requires a simulation step before practical implementations.


I. INTRODUCTION
The primary goal of this study is to address the dynamical modeling challenges encountered in the control of the Stewart platform by utilizing a reinforcement learning (RL) approach.The Stewart platform, widely used in various applications such as flight simulators, driving simulators, and vibration testing for large structures, poses unique difficulties in its control due to its complex dynamics [1]- [3].By employing an RL approach, the study aims to develop a solution that can effectively handle the complex dynamical modeling requirements of the Stewart platform.One of the goals in the field of artificial intelligence is to tackle complex problems by leveraging high-dimensional sensory data [4].RL is a specialized branch of Machine Learning (ML) that revolves around an agent's interaction with its environment, guided by specific policies to maximize future rewards [5].The agent's objective function is to optimize the cumulative sum of these rewards, with the Bellman equation serving as the foundation for defining optimal behavior.The agent's learning process is driven by a reward-penalty scheme, where the quality of selected actions from the policy space determines the outcomes [6].In optimal control theory, a perfect system model with comprehensive descriptions is typically assumed [7].However, such models often encounter issues such as modeling errors, uncertainties, and computationally expensive approximations.In contrast, RL operates directly on measured observations, encompassing uncertainties and nonlinearities inherent in the system.Consequently, when dealing with complex systems and situations where classical analytical methods may yield inadequate control performance, RL is a favorable choice [8]- [12].
Within the literature, numerous neural network methodologies have been proposed to address the forward kinematics of the Stewart platform which is complex due to a set of high nonlinear equations [13]- [16]; but in terms of RL, there are few studies only.In a most recent study done by [17], deep RL algorithms to tune the gain parameters of a PID controller are used.This approach facilitated continuous learning and tuning of the controller's parameters.Different from this study, we show how to learn the forward dynamical control block using deep RL instead of only tuning PID control gains.In pursuit of this objective, we first delve into a comprehensive study of the platform's dynamics.If the dynamical model of the system was known precisely, we could perfectly control the platform.The kinematic model represents how the platform moves, but the dynamic model describes why the platform moves [18].Control and simulation greatly rely on dynamic modeling, as it plays a key role in these domains.Unlike serial robots, the dynamic modeling of parallel robots is complicated due to the closed-loop kinematics inherent in their design [10], [19].
There have been various suggestions for conducting dynamic analysis on parallel manipulators.One commonly used approach is the traditional Newton-Euler formulation, which is also employed to analyze the dynamics of general parallel manipulators.This formulation offers a framework for assessing the forces and torques acting on the system, allowing for a comprehensive understanding of its dynamic behavior [20], [21].To accurately describe the system dynamics within this formulation, it is crucial to derive the equations of motion for each leg and the moving platform.A preferable approach for achieving the dynamic formulation involves carefully selecting a set of independent generalized coordinates and subsequently deriving the dynamic equations using these coordinates and their corresponding time derivatives.Typically, these coordinates correspond to the positions of the moving platform.To achieve this type of formulation, it is necessary to eliminate internal forces and other passive joint variables.However, this process results in a large number of equations, which can have a negative impact on computational efficiency.We also have the Lagrangian formulation that proves to be highly effective in eliminating undesired reaction forces.However, the closedloop structure of parallel manipulators imposes constraints that make it challenging and impractical to obtain explicit equations of motion using a group of separate generalized coordinates [22], [23].
As articulated, dynamical modeling has its fair share of challenges.Here, we demonstrate one method to enhance comprehension of the subject matter.We derive dynamical equations of motion using the principle of virtual work and the notion of link Jacobian matrices, as explained in [24].Python libraries for symbolic mathematics (SymPy [25]) and numerical computing (NumPy [26]) are used to formulate and solve equations of motions represented in [24].The primary challenge is to formulate the equation.Then we solve these equations by integrating forward in time.We show that deriving the dynamical model is a prohibitive task see Appendix A, and the final model is inaccurate and contains many simplifications; consequently, it is not suitable for real-time applications.While existing literature has made some attempts to address feedforward control methods using a reinforcement learning approach in some cases [27], [28], our current knowledge indicates a notable absence of such investigations in the domain of the Stewart platform.Therefore, we replace the classical feedforward inverse dynamic block with an RL agent to apply the required actions, which are the leg's forces here, for different trajectory states.We present an RL control topology to benefit from existing classical control topologies, inverse kinematic modeling, and the inverse dynamic of the system.We use the RL in a hybrid mode that helps to increase the performance of the control.
Due to the complexity of dynamic modeling, obtaining its derivation through conventional methods is challenging.Consequently, many opt to omit dynamic models in feedforward control, relying solely on feedback control in real-world applications.The proposed approach has the potential to offer a solution to overcome this challenge.RL also offers a dynamic approach that adapts to complex and nonlinear dynamics, mitigating the shortcomings of classical feedforward inverse dynamic blocks.By leveraging RL, the control system becomes more adept at learning optimal strategies, thereby enhancing precision while concurrently reducing the development time traditionally associated with precise control algorithms [29].This explicit integration of RL directly tackles the identified challenges and presents a promising avenue for improving Stewart platform control in real-world applications.
We benefit from three Deep Reinforcement Learning (DRL) algorithms with two model-based RL algorithms.We first employ three DRL algorithms: the asynchronous advantage actorcritic (A3C) algorithm [30], the Deep Deterministic Policy Gradient (DDPG) approach [31], and the Proximal Policy Optimization (PPO) technique [32] to send force action to six legs motor directly beside the PID controller force output.Then we try two model-based RL algorithms, namely probabilistic inference for learning control (PILCO) [33] and model-based policy optimization (MBPO) [34], first to learn the dynamic model of the entire system and then utilize it to control the Stewart platform like feedforward control.
In a second attempt to improve the work carried out in [17], we propose a hybrid RL algorithm to learn a dynamical model of the system, resulting in more sample efficiency, wellsuited to real-world applications like robotics [35], [36].Even though model-free RL algorithms have succeeded in many areas, like video games and robotics, high sample complexity can limit the usage of model-free algorithms to simulated environments [31], [37]- [39].Model-based RL algorithms use significantly fewer samples.Model-based methods extract more valuable information and are more data efficient than modelfree algorithms.However, they suffer from model bias, meaning the model assumes it learned the environment's dynamic accurately.However, a poorly learned model may result in poor performance [34] Regarding the search for the most effective strategy for the control of robot manipulators, DRL has demonstrated its efficiency.Nonetheless, there is considerable scope for enhancing its applicability in controlling parallel robots, motivating us to explore ways to fill this gap.The current paper's contributions are driven by the need for a dynamic model system that enhances the control performance of parallel robots while reducing the complexity of deriving the system's dynamic model.To summarize, these contributions, facilitated by the RL algorithm, can be outlined as follows: 1) Enhancing the efficacy of a conventional control loop applied to parallel robots through the implementation of a suggested RL-based feed-forward loop.The integration of a Predictive RL agent within the feedforward control loop, in conjunction with the classical control loop, augments the collective control performance.2) Simplifying the requirements for deriving the classical dynamic model used in the feed-forward control loop.
The suggested RL topology eliminates the necessity to formulate a complex dynamical model which is also a time-intensive task.3) Performing performance comparison with five DRL algorithms to explore the control capability of a parallel robot.We employ a combination of modelfree and model-based RL algorithms and subsequently conduct a comparative analysis, explaining the respective advantages and disadvantages within the context of the Stewart platform.
The rest of this paper is organized as follows: After reviewing kinematics and a common control strategy for the Stewart platform in Section II, we try to achieve a dynamic model of the platform via a classical method in Section III.In Section IV we present our approach to replacing the feed-forward dynamical model with a RL agent.The experimental setup and results are provided in Section V.The final observations of the paper and possible future works are presented in Section VI.

II. KINEMATICS AND CONTROL STRATEGY OF STEWART
PLATFORM Calculation of the inverse kinematic of the Stewart platform is straightforward [40].Therefore, we perform inverse kinematic modeling to improve the RL agent and final control performance.The explicit mathematical nature of inverse kinematics in terms of parallel robots accelerates the learning process for the RL agent.Because, by providing precomputed solutions, this approach substantially reduces the time and computational resources required for the RL agent to deduce these kinematic relationships independently.Fig. 1 illustrates a drawing of the kinematics and coordinate system of the Stewart platform.
Fig. 1.Drawing of the kinematics and coordinate system of the Stewart platform [17] There are two coordinate systems, base B xyz and moving platform M xyz .Since we have six legs in the Stewart platform design, we have six attachment points in both the base and motion platforms.In the inverse kinematic, the goal is to calculate the length of each leg {l 1 , l 2 , l 3 , l 4 , l 5 , l 6 } given the pose of the moving platform, which means the position vector, P = [P x P y P z ], and the orientation vector, O = [ϕ θ ψ].We can calculate the connecting point coordinates through Equation (1).
Therefore, we have all attachment points coordinates (B i and M i ) for the given separation angles of the υ bi in the base platform and the υ mi for the motion platform.However, we calculated each attachment point of the moving platform in its coordinate system.Nonetheless, given the position and orientation of the moving platform, we want to calculate each point's coordination regarding the base.Coordinate transformations simplify mathematical expressions, decouple limb motions, and provide a unified representation.In order to transform the moving platform coordinate frame with respect to the base platform, we convert the pose of the end-effector using a translation position vector P Equation (2a) and a rotation matrix B R T Equation (2b) [41].Hadi Equation (3) calculates the position vector using the position and orientation matrices.The actuator length is then obtained as l i = ||L i ||.
where M i and B i are the attachment point coordinates in moving and base platform frames, respectively.
After inverse kinematic calculation, we want to control the platform.It means we want the moving platform to reach the desired reference pose.Plenty of control techniques have been developed for robotic manipulators, such as a nonlinear model and a multi-input/multi-output for multiple degrees of freedom robots [42]- [45].However, in industry, controllers usually control individual joints through drivers linearly [46].Fig. 2 shows one of the known control topologies of the Stewart platform, which uses the length measurement of each leg as feedback in the closed-loop system [47].In Figure 2, the parameter q is the linear displacement of an actuated prismatic joint which is equal to each leg's length and q d is the desired linear displacement of legs which is calculated by the inverse kinematic model.Instead of the Task Space control, measuring the position and orientation of the end-effector, the controller is implemented in the joint space, where we calculate the joint space error e q .This control topology converts the desired trajectory to the desired actuators' lengths through the inverse kinematic model, which we derived mathematically.Then, the controller calculates the required actuator torque τ .Having the inverse dynamic model of the system has the potential to enhance the efficiency of the controller, as illustrated in the control topology.Nonetheless, in Section III we demonstrate that obtaining and resolving the dynamic model of the Stewart platform is intricate.To address this complication, as outlined in Section IV, we substitute this loop with an RL agent.This alteration is intended to address the aforementioned challenge.

III. DERIVING AND SOLVING DYNAMICAL EQUATIONS OF MOTION
To derive the dynamical equations of the motion, the principle of virtual work and the notion of link Jacobian matrices, as outlined in [24], are employed and derived in Appendix A. As demonstrated in the appendix, the main challenge is to formulate the equations.To address this, we employ Sympy to methodically compose and derive these equations in a concise and organized manner.The code is accessible in the opensource repository provided by [48].Despite leveraging the Sympy library and its advantages for handling the formulation, it remains intricate in that context.The task of deriving the inverse kinematics of the Stewart platform with considerations like the elimination of joint frictions is notably laborious.Once we engage in experiments involving the simulated Stewart platform in Gazebo [49], it becomes evident that its dynamic model is considerably more detailed and intricate than the one we formulated mathematically.It contains features like friction, damping, and other dynamical values.We ignored most of these features to model the Stewart platform's inverse dynamic.Otherwise, formulation and solving dynamic equations would be much more challenging tasks.
Upon achieving this derived formulation, the next required step involves real-time solutions to enable its application.

A. Solving Dynamical Equations of Motion
We solve the dynamical equations using numerical methods.For solving the equations, we define a trajectory for the moving platform.It means that we need to specify X , Ẋ , and Ẍ.The algorithm used to solve the inverse dynamic for specific points is described in Algorithm 1:

Algorithm 1: Numerical calculation of the inverse dynamic for the given point
Calculate the Jacobians: Jp, Jx, and Jy are calculated for the given X and inertial properties; Calculate the Forces: F p , F x , F y , and F z are calculated for the given X, Ẋ, Ẍ, and inertial properties; Calculate the Leg Forces: The required leg forces are calculated via inverse dynamic Equation 45.

IV. FEED FORWARD CONTROL VIA REINFORCEMENT LEARNING
The disadvantages of the method which uses the inverse dynamic model and a classical feedback control loop structure as presented in Fig. 2 are the complexity of both derivation and solving of the equations.Determining the inverse dynamics is a tough task.In addition, following various simplifications, it becomes necessary to once again linearize this model around a designated equilibrium point for practical applications [51], [52].Yet, this process is task-specific, and we need to do all steps in case new changes are made to the design and structure of the platform.In this section, we present an RL control topology to benefit from existing classical control topologies, inverse kinematic modeling.We use RL in a hybrid mode that helps to increase the performance of the control, as presented in Fig. 7.
We form RL and control blocks similar to the RL setup experimented in [17].However, we change the action space completely from changing PID gains to applying force to each leg.As shown in Fig. 7, we also benefit from inverse kinematic modeling, feedback control loop, and PID controller.The essential components of the RL setting are as below: 1. Action space: The action space at time step t can be expressed as:

State space:
We define our problem states as Equation ( 9): where X is the pose vector of the end-effector at time step t.
It is worth noting that the pose velocity Ẋ and the difference between the target pose of the end-effector and the actual pose ∆X can be considered as additional state variables.However, it is important to be mindful that increasing the dimensionality of the state space can lead to slower convergence of the algorithms [53], [54].However, we utilize ∆X directly in the reward setup, and the pose velocity Ẋ checks whether the episode is done.
For the three model-free DRL algorithms, we define a similar reward function introduced in [17].We consider the well-known reaching task as our goal.

Reward setup:
We choose a quadratic reward function similar to work in [55] and given in Equation (10), with a modification that in case of any goal reaching failure or instability of the system, the agent is being penalized significantly by a value of −1000.
In Equation (10), V ∈ R 4x4 is a diagonal positive definite weight matrix for the errors, D = e xyz is the distance to the goal value, and δ defined as a threshold distance to handle the falling of the platform.For the reward function defined in Equation (10) with the desired pose having index d, error vector is calculated as ∆X = [e xyd , e ϕ , e θ , e ψ ] containing all the error terms given in Equations (11a), (11b), (11c), and (11d).
The three DRL algorithms utilize a parameterized policy function to enable the utilization of continuous actions or the ongoing adjustment of leg force values.This policy function takes the robot's state, represented as X, as inputs and generates continuous leg forces as outputs using its corresponding neural network.Nevertheless, directly learning the policy network comes with inherent high variability.This challenge prompted the introduction of actor-critic methods [56]- [58].In this paradigm, the "Critic" assesses the value function Q(s, a|W Q ) using the Bellman equation, akin to Q-learning.Subsequently, the "Actor," parameterized by the function µ(s|W µ )", adjusts the policy distribution following the guidance provided by the Critic.While these three DRL algorithms share the foundational Actor-Critic framework, they diverge in their architectures, which we elucidate below.
The first algorithm we consider is the DDPG approach, an adaptation of the deterministic policy gradient (DPG) algorithm [59] that incorporates deep learning principles.DDPG stands as an off-policy and model-free Deep DRL algorithm capable of acquiring proficient policies for various tasks, even when presented with low-dimensional observations like joint angles and Cartesian coordinates [31].Implementing DDPG involves a relatively simple actor-critic architecture, employing parameterized actor and critic functions: µ(s|W µ ) and Q(s, a|W Q ).To enhance learning stability, DDPG incorporates target networks for both the actor and critic functions.This entails creating duplicates of the actor and critic networks, referred to as µ ′ (s|W µ ′ ) and Q ′ (s, a|W Q ′ ), respectively.The next algorithm is A3C, an on-policy and model-free DRL approach.A notable benefit of A3C lies in its utilization of parallel actor-learners, which contribute to stabilizing the training process.A3C has proven successful in addressing a diverse range of continuous motor control challenges [30].While DDPG is trained offpolicy using samples from a replay buffer to mitigate sample correlations, A3C employs an alternative approach.Instead of relying on experience replay, A3C simultaneously runs multiple agents asynchronously across multiple instances of the environment.This strategy aims to reduce the issue of data correlation.For our third DRL algorithm, we utilize PPO, a data-efficient and dependable variant derived from the trust region policy optimization (TRPO) approach [60].Similar to the prior algorithms, PPO is also a model-free, on-policy method that follows a pattern of gathering data from the policy and then undergoing multiple optimization epochs to enhance the policies.
In the subsequent algorithms, we explore two model-based RL approaches.However, the fundamental question arises: why should we consider employing a model-based RL algorithm?In general, an agent in RL can make decisions in two main ways, model-based and model-free.In model-based RL, the agent utilizes its model to decide what action to take.However, in model-free RL, without having a model, the agent tries to learn the optimum policy.The main question is if we need such a model to take action.Model-free RL algorithms have succeeded in many areas, like video games and robotics, in recent years [31], [37], [38].However, high sample complexity limits their usage to simulated environments mostly.On the other hand, model-based RL algorithms use significantly fewer samples than model-free ones.Therefore, by learning a dynamical system model, we expect sample efficiency, which is very important in real-world applications like robotics [36].In general, model-based methods extract more valuable information and are more data efficient than model-free algorithms.However, they suffer from model bias, meaning the agent thinks it accurately learned the environment's dynamic.Whereas, a poorly learned model results in poor performance.Many modelbased RL algorithms have tried to address the model bias differently.Many successful machine learning applications are based on data augmentation [61]- [63].Sutton in [64] presents a model-based Dyna algorithm in which a model is learned in a supervised manner through collected data and new data generated under the model.The policy improvement utilizes the model data.But, as a problem of model-based RL, modeling errors could cause diverging temporal-difference updates.In Dyna and standard RL framework, we want to maximize the expected return from acting according to policy π in the environment under some dynamics p: However, the learned model is commonly reliable for a single-step predictive model.However, the modeling errors could sum up for the long horizon, resulting in poor performance for long rollouts.
We experiment model-based RL approach first with the MBPO algorithm which utilizes short rollouts from the predictive model rather than full-length rollouts starting the initial state distribution to gather data and update the policy [34].MBPO addresses three key aspects to enhance the Algorithm: the parameterization of the predictive model p θ , the policy optimization π based on model samples, and the method of querying the model for samples.In MBPO, the predictive model utilizes a bootstrap ensemble of dynamic models, denoted as p 1 θ , ..., p B θ , which are probabilistic neural networks that generate Gaussian parameterizations [34].To ensure diversity in the dynamics models, MBPO uniformly selects one predictive model Hadi YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach ISSN: 2715-5072 124 at random from the ensemble.This approach allows for the sampling of different dynamic models.For policy optimization, MBPO employs the soft actor-critic (SAC) algorithm [37].When the horizon span k is short, MBPO uses the predictive model to conduct multiple short rollouts.This approach helps generate a substantial collection of model samples for policy optimization.This comprehensive set of model samples enables MBPO to take multiple policy gradient phases per environment sample, exceeding the capabilities of model-free algorithms [34].
As a second model-based RL approach we experiment with PILCO that uses its observed samples efficiently.The problem with the model-based methods is model errors.In PILCO, the dynamic model of the system is approximated by Gaussian processes.In this manner, PILCO addresses the model-bias problem of the model-based RL while using less sample of data.The assumption in PILCO or model-based RL is that we do not have prior or expert knowledge about the model of the system, like differential equations for the dynamics.However, we want to learn the model from scratch.
PILCO utilizes non-parametric probabilistic Gaussian processes (GPs) as the basis for its dynamic model, considering model uncertainties as noise in the system.To account for these uncertainties during planning and policy evaluation, PILCO incorporates them into its framework.For policy search and update, PILCO employs an analytic policy gradient method, enabling effective optimization of the policy [65].PILCO considers model dynamic as Equation ( 13): The system under consideration involves continuous-valued states denoted by x ∈ R D and controls denoted by u ∈ R F .The transition dynamics of this system, represented by the function f , are not known prior.The objective of policy improvement is to discover a deterministic policy or controller denoted as π, which maps states x to actions u such that π(x, θ) = u.The goal is to minimize the expected return associated with the policy: where c(x t ) is the negative reward or cost of being in state x at time t.
Regarding policy optimization, PILCO aims to maximize the expected cumulative reward within a finite time horizon, utilizing the learned Gaussian process (GP) model for each potential policy.This involves simulating the system forward in time and calculating the expected cumulative reward.The optimization problem is subsequently solved using gradientbased techniques such as the conjugate gradient or L-BFGS-B method.After optimization, PILCO continues collecting the dataset by executing the policy and updating the probabilistic model of the system dynamics.This allows the algorithm to learn effectively with relatively little data, which is particularly useful when data collection is expensive or time-consuming.However, we note that PILCO is very computationally expensive in model and policy optimization steps.In terms of the exploration-exploitation trade-off, PILCO employs a saturating cost function that facilitates natural exploration when the predictions are distant from the target [65].This means that the policy explores more actively in situations where predictions are far from the target.Conversely, when predictions are close to the target, the policy remains close to the learned trajectory and focuses on exploitation.The fast learning speed of PILCO makes it suitable for controlling real-world applications such as robotics.However, it is worth noting that PILCO is currently limited to episodic setups.

V. EXPERIMENTS SETUP AND RESULTS
We experiment on the similar simulated Stewart platform presented in [17] with our newly proposed algorithm, where the same inertial properties of the experimented platform are shown in Table II.The inertial characteristics impact how the platform responds to external forces and influences the overall behavior that the learning process must adapt to.Our objective is to guide the moving platform to achieve the target pose (reaching task) defined as x = 0, y = 0, z = 1.1, ϕ = 0, θ = 0, ψ = 30 , originating from the initial state characterized by the pose x = 0, y = 0, z = 0.2, ϕ = 0, θ = 0, ψ = 0 .This transition entails a heave motion of 90 cm and a yaw rotation of 30 deg for the platform.It's worth noting that these specific poses were experimented with in a prior study [17], and we opted for the same target to facilitate comparative analysis and validate the effectiveness of the proposed methodology.Across all conducted experiments, we establish the episode count at 500, with a maximum of 200 steps allowed per episode.These values were determined based on a careful consideration of the learning process and computational efficiency.The choice of 500 episodes allows for a sufficiently iterative learning process, allowing the reinforcement learning agent to adapt and refine its strategies over a substantial number of training iterations.Meanwhile, setting a maximum of 200 steps per episode is a balance between capturing complex learning scenarios and managing computational resources effectively.At the beginning of each Hadi YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach ISSN: 2715-5072 125 episode, the robot is positioned in the initial pose, while the agent's distance threshold is defined as δ = 1.5 m.Within each learning episode, the platform is subject to a penalty if it deviates significantly from the desired final pose (surpassing a distance of 1.5 m).The penalty value is configured as −1000, as indicated in Equation 10.The penalty value was chosen to strongly discourage undesired actions and deviations during the learning process and the distance threshold was determined in alignment with a comparable setup to [17].The weight matrix for error diagonals is established as . The neural network structure for all three algorithms, along with their respective hyperparameters, is detailed in Table III and Table IV, respectively.In the case of MBPO, the network architectures, and hyperparameters are specifically presented in Table V.For PILCO, a similar reward function configuration is used.In PILCO, the cost function imposes a penalty based on the Euclidean distance between the current state and the target state.Although a defined reward is obtained from the reaching task, it is not utilized in the policy optimization of PILCO.Instead, only distance penalties are employed to address the task, as specified in the PILCO paper [33].In the PILCO setup, reaching the target with high speed often directs to overshooting, resulting in elevated long-term costs.Therefore, the effects of the failing task resulting in a negative -1000 reward are not directly considered in the PILCO's policy optimization, but it shows its effect as the Euclidean distance from the target is too high in case of failure.We determine the specific hyperparameters used in PILCO according to our    algorithms tend to exhibit more stable convergence compared to model-free algorithms.
The overall control performance of the three model-free algorithms is more stable than the result experimented in [17] with faster convergence.Comparing the quantitative time-domain response of the results obtained from the newly proposed method with those in [17], we observe an approximate 96% enhancement in rise time and a 75% improvement in settling time, despite a similar overshoot observed in [17] in the PILCO case.This underscores a substantial improvement achieved by integrating reinforcement learning (RL) into the new proposed structure, particularly in the feedforward loop section.It is noteworthy that the RL methodology applied in [17] focused solely on tuning the PID values of the controller, whereas in our approach, we learn a dynamic model.The utilization of this acquired knowledge contributes to an overall performance improvement alongside the classical PID controller.PILCO shows great performance, in terms of stability and convergence, against the other three model-free and one model-based RL algorithms.It learns the system's dynamics well and utilizes it to optimize the policy, resulting in the best performance PILCO often integrates a probabilistic model and incorporates uncertainty into its predictions.This can enhance its adaptability to varying conditions, contributing to stability in learning tasks.Furthermore, PILCO may leverage a model-based approach that refines its understanding of the system dynamics, leading to more efficient convergence The only disadvantage of the PILCO algorithm is that it is very computationally expensive.We show the result of GPU utilization in Fig. Regarding the time response performances of the five RL algorithms, we run an episode for each learned algorithm and compare their performances.Fig. 10 illustrates the timedomain performance of the moving platform's states for the model-based approach allows it to effectively capture system uncertainties, enabling better adaptability to dynamic changes in the environment.This adaptability contributes to PILCO's reduced rise time, as it can swiftly adjust its control policies in response to evolving conditions.Moreover, the incorporation of uncertainty modeling aids PILCO in minimizing steadystate errors.By accounting for and mitigating uncertainties, PILCO exhibits enhanced precision in maintaining the desired pose.The stability exhibited by PILCO can be linked to its comprehensive understanding of system dynamics through a probabilistic and model-based approach.This enables PILCO to navigate the learning process with increased robustness, resulting in more stable and reliable control performance.Table VII shows the time domain performances of the 5 algorithms.and adaptability in real-world scenarios.In conclusion, the PILCO algorithm has an overall better performance than other algorithms.The convergence of MBPO is notably the weakest, even though it shows more stability in the final steps compared to the other three model-free algorithms.In general, modelbased algorithms tend to achieve more stable convergence than their model-free counterparts.However, all algorithms struggle in terms of both time performance and convergence when measured against PILCO.This Algorithm, grounded in probability and models, effectively handles system uncertainties, making it adaptable to dynamic environmental changes.This adaptability reduces rise time by allowing swift adjustments to control policies in evolving conditions.Additionally, uncertainty modeling minimizes steady-state errors, enhancing precision in maintaining the desired pose.Although PILCO is sample-efficient and has better and faster convergence performance in learning, it is the most computationally algorithm.
Acknowledging the computational expense associated with PILCO, future research endeavors should prioritize optimizing computational efficiency without compromising performance.Also, expanding the experiment to encompass reaching an area rather than a precise pose could yield insightful findings about the platform's operational range.This broader exploration would provide a more comprehensive understanding of its capabilities and limitations.Moreover, conducting tests of the learned reaching task in similar scenarios would yield significant benefits.Engaging in such extensive testing would validate the applicability of the learned reaching task and set the stage for the platform's broader utilization in real-world applications.Furthermore, introducing trajectory following as a task can contribute to a deeper comprehension of how RL algorithms navigate and engage within the context of the Stewart platform environment.In future works, exploring this broader area of study could greatly improve our ability to control the Stewart platform.This might help us create better ways to control the platform for various applications.Finally, To validate learned behaviors in practice, extensive testing in controlled yet realistic settings is crucial.Simulating challenging conditions, such as turbulence in flight simulators (using the Stewart platform as a motion platform), can provide insights into the platform's adaptability.Collaborating with industry experts and conducting field trials in relevant environments will be instrumental in validating the learned behaviors and assessing the platform's effectiveness in addressing the complexities of real-world applications.Additionally, these findings may also have broader applicability to other robotic systems or real-world scenarios, such as serial manipulators, due to the similar duality that exists in those contexts.

APPENDIX A DERIVING DYNAMIC FORMULATION
For consistency with the numbering of the equations, the generalized coordinates of the system are defined as (l 1 , l 2 , l 3 , l 4 , l 5 , l 6 ), which correspond to the six joints of the platform.Each link in the system is associated with a reference frame that is attached to it.The two most essential reference frames are the base (B) and moving platform (M ) frames.Their orientation matrix is defined as Equation (15).Here, the angles β, α, and γ represent successive rotations respectively near the x-axis, y-axis, and z-axis.
The matrix form of Equation ( 19), denoted as Equation (24), is expressed as follows: In Equation ( 24), the vector Ẋp represents the linear and angular velocities of the moving platform, and Jbi is defined according to Equation (25).
We can write Equation (20) like Eqation (26) as below: Repeating Equation ( 26) six times for each leg, we have the equations in matrix form represented in Equation (27). where Equation ( 28) is known as the manipulator Jacobian matrix.Also, we can write Equations ( 21), ( 22) and (23) Then, if we combine the Equations ( 29) , ( 30) and ( 31) we have : where the link Jacobian matrices are Equations ( 34) , ( 35): With the availability of these Jacobian matrices, we can now proceed to solve the inverse dynamics problem, which involves determining the forces required to achieve a desired motion.As outlined in [24], the equations of motion are formulated using the principle of virtual work.Equation (36) demonstrates the external and inertia forces acting on the center of mass of the moving platform.
In Equation (36), f e and n e represent the external force and moment, respectively, applied at the center of mass of the moving platform [24].Similarly, we can consider the same forces for the cylinder and piston of each leg, assuming that the gravitational force is the only external force present: The principle of virtual work is stated as below: where in the leg frame, we have the applied and inertia forces, i F 1i and i F 2i , and their corresponding virtual displacements δ ( i X 1i ) and δ ( i X 2i ).
To establish a relationship between these virtual displacements, we need to express them in terms of a set of generalized virtual displacements, which can be defined as follows: δq = J p δX p (40) By substituting Equations ( 40), (41), and (42) into Equation (39), we can derive the dynamics of the Stewart platform as follows: Then, If J p is not singular, we can have the final inverse dynamic as shown in Equation ( 45): Hadi YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach . As we explain in Section IV the selected model-based RL algorithms address the model bias differently.Like three model-free algorithms, the two model-based RL Hadi YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach Journal of Robotics and Control (JRC) ISSN: 2715-5072 119 algorithms are highly suitable to handle the continuous actionstate spaces characteristic of the Stewart platform.PILCO relies on analytic gradient computation, and MPBO utilizes ensembles of models.By employing the probabilistic model (PILCO) and ensembles (MPBO), these two model-based algorithms are capable of achieving model-free performance with significantly fewer samples.To sum up, we experiment with five distinct RL algorithms using a proposed control framework.By adding the learned feedforward control topology by model-free and model-based RL algorithms to the existing PID controller, we demonstrate a noticeable enhancement in the overall control performance of the Stewart platform.

Fig. 2 .
Fig. 2. General joint space control topology of the Stewart platform with Feedforward control extension.

Fig. 3 .
Fig. 3. Platform trajectories of Example 1 and the inverse dynamic solutions.

Fig. 3 (
Fig. 3 (d) shows the inverse dynamic solution for the trajectory 4, which takes 9 min to solve for 70 separate time points between 0 to 7 seconds.Around 4.8 s we have P z close to zero.In order to verify our observation about the trajectory values around zero, we performed further experiments.First, by moving the P z trajectory a bit up as shown in Equation (5) and Fig. 4 (a), (b), (c), we observe that the effect of this point has been reduced but still causing a significant jump in the required forces.Fig. 4 (d) shows the inverse dynamic solution for this Example 2 trajectory, which took 8 min to solve for 70 separate time points between 0 to 7 seconds.
(a), (b), (c) as Example 4. Fig. 6 (d) shows the inverse dynamic solution for this trajectory, which takes 9 min to solve for 70 separate time points.

Fig. 6 .
Fig. 6.Platform trajectories of Example 4 and the inverse dynamic solutions

Fig. 7 .
Fig. 7. Propsed RL topology in the control of Stewart platform in a feedforward manner Hadi YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach Journal of Robotics and Control (JRC) ISSN: 2715-5072 123

Fig. 8 Fig. 8 .
Fig.8shows the final reaching task rewards of the five RL algorithms, 3 DRL, and 2 model-based RL algorithms.We experiment with each algorithm 5 times and average over the runs to have a valid scientific comparison.As shown in Fig.8, all five rewards started from negatively large values toward converged to zero.The best steady performance is for PILCO with the minimum convergence step.The worst convergence is for MBPO even though it is more stable in the last steps in comparison to the other three model-free algorithms.In general, model-based

9 .
As we see each training PILCO run takes over 10 hours, almost 10 times longer than model frees' training time, and 3 times longer than the other model-based algorithm ( MBPO).In addition to the duration, GPU utilization and GPU temperature in training with the PILCO algorithm are very high as shown In Fig. 9. Overall, even though model-based algorithms are sample efficient, they are computationally expensive, requiring more computational resources.

Fig. 9 .
Fig. 9. Computation time and GPU usage of different RL algorithms in training

Fig. 11
Fig. 11 presents the action applied via each RL agent in every step of the running with a trained RL agent.MBPO always selects to go to the boundaries of action spaces.The boundary-seeking tendency exhibited by MBPO, stands as a crucial determinant in the suboptimal performance of MBPO across all scenarios.This behavior contributes to an elevated steady-state error in the final response.However, the other 4 algorithms are hovering around zero.The A3 and PPO have more erratic behavior than others.DDPG and PILCO have the smoothest learned action closer to zero.

Fig. 10 . 6 Fig. 11 .
Fig. 10.Time domain performances of the moving platform's states b i,y −b i,x 0  

F
P = fp np = f e + m p g − m p vp n e − A I p ωp − ω p × ( A I p ω p ) YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based

TABLE I .
SIMPLE POINTS CALCULATION OF THE INVERSE DYNAMIC FORMULATION

TABLE II .
INERTIAL PROPERTIES OF THE EXPERIMENTED STEWART PLATFORM

TABLE III .
ARCHITECTURE OF THE NEURAL NETWORKS OF THE DRL ALGORITHMS

TABLE IV .
ARCHITECTURE OF THE NEURAL NETWORKS OF THE DRL ALGORITHMS

TABLE V .
HYPER-PARAMETERS AND NEURAL NETWORK ARCHITECTURE USED IN THE TRAINING OF MBPO

TABLE VII .
TIME DOMAIN SPECIFICATIONS OF HEAVE AND YAW STATES OF FIVE RL as: Hadi YADAVARI, Addressing Challenges in Dynamic Modeling of Stewart Platform using Reinforcement Learning-Based Control Approach Journal of Robotics and Control (JRC) − e 2 ) i J bi,x (d i − e 2 ) i J bi,y d i i J bi,y