Application of Deep Reinforcement Learning Tracking Control of 3WD Omnidirectional Mobile Robot

Deep reinforcement learning, the fastest growing technique, to solve real-world complex problems by creating a simple mathematical framework. It includes an agent, action, environment


Introduction
Wheeled mobile robots have many advantages compared to their legged counterparts such as structural simplicity, energy efficiency, high locomotion speed, and low cost of manufacturing. One of the types of a wheeled mobile robot is holonomic wheeled mobile robots which can be designed to move in any direction without changing its orientation. These omnidirectional robots are made up of three or more Swedish wheels which can move not just forward or backward but sideways also A 3-wheel mobile robot is shown in Figure 1. The desired capability of an advanced robotic system is that of an adaptation of effective behavior while interacting with the dynamic environment [19]. The control hierarchy of wheeled mobile robots is often categorized as high-level and low-level. In high-level control, one of the three major control paradigms, (e.g. hierarchical, reactive and hybrid) are applied to undertake a motion task such as path following, point to point tracking, trajectory tracking, wall following, and obstacle avoidance [14]. The hierarchical control architecture requires a complete world model to plan an action based on sensor data. Due to its high computational requirements, the hierarchical control scheme is, however, slower to respond. The reactive control architecture does not have a planning stage. It executes an action based on the sensor data and hence it is quick in producing a response. The traditional method to control the movement of these robots is to apply classic controllers like PID using mathematical modeling of these robots and their inverse kinematics. But now reinforcement learning, artificial intelligence, and even deep learning are being used very commonly instead of the previous methods. As the robots experience many uncertainties in the real world, the traditional controllers experience difficulties. These uncertainties include fluctuations in the environment and goals. Reinforcement learning can be combined with deep learning to solve such complex problems with ease. Analogies between temporal difference (TD) reinforcement learning algorithms and dopaminergic neurons of the brain have demonstrated by recent studies in cognitive science. Despite nature-derived inspiration, many effective implementations of reinforcement learning (RL) for self-governing drive and movement controlling of dynamic robotic systems manipulation have proven the real-time application of previously theoretical concepts for the control of physical systems [3,[6][7]. Many of these methods use specific policy structures to represent policies to put a limitation on the number of iterations which is necessary for optimizing the results. Though efficient, but by adopting this approach there is a loss of generalization as it tightens the policy space to some specific trends [10]. To overcome these non-linear function approximators, neural networks are used for policy parameterization. This eliminates the need for handwritten specific policy representation and human supplied demonstrations to adjust them. Furthermore, usage of parameters in higher numbers also theoretically ensures learning of those complex behaviors that would not have been conceivable with linear handwritten policies.
In [4], partial reinforcement learning is used along with a neural network-based algorithm for the tracking of wheeled mobile robots to overcome the complexity of time-varying advance angle. Both actor-critic adaptive laws are defined by the gradient descent method and the Critic network was defined to maximize the long-term reward while actor-network is defined to minimize a long-term cost function. In [12], the problem of performance analysis of visual servo control of the robot is considered with measurements and modeling errors. A solution is proposed by coupling Q-learning and SAR-SA with the neural network. In [15], an actor-critic algorithm for PeopleBot robot is used to find and reach the table so it can pick up the things from it by using a camera mounted on it. The network is trained from random wandering to finding a table. In [23], reinforcement learning is used to learn the walking of an omnidirectional humanoid robot and design a controller for high-level push recovery. In [28], a deep reinforcement learning algorithm DDPG is implemented in continuous action space for a Mobile robot that uses a single network structure to learn all three skills: go to the ball, turn and shoot. The main drawback of this technique is that if the opponent learns to block the shot, then this will fail.
A reinforcement learning algorithm SARSA and Q learning are applied in [1] for robot navigation by discretizing the continuous state and actions. Discretization determines the performance of the algorithm applied. Q-Values in the algorithm are represented in tabular form which requires large memory spaces and difficult mathematical calculations. A deep reinforcement method is implemented in [8] for collision avoidance for an indoor service robot. The controller is parameterized using the neural network while DDPG is used to train the agent. It is proved in [9] that the decentralized planning outperforms its centralized counterpart in-terms of computational assets. The technique is confirmed on two problems: a lengthy version of the 3-dimensional mount car, and a ball-pushing act performed with a differential-drive robot, which is also verified on a physical setup.
In the last few years or so, deep learning made a great impact maybe this is due to the improvement in the computer technologies which are used to train these deep neural networks. For extracting useful information from visual data object detection and object classification techniques are used these techniques are convolutional neural networks (CNN) based. CNN is a subclass of deep learning, in which meaningful data is used to train models to learn patterns and make decisions. CNN based models are better able to detect and extract information from images, but there is a limitation of data and greater computation cost is required. In CNN some models are pre-trained and need to be trained, in pre-trained models the model is already trained on specific data. Small models that are pre-trained yield better results but in cases where models are huge a lot of computation is not focused on the original task, extra parameters are involved. To reduce the computation cost pruning parameter is proposed by Zheng et al. [25], the pruning method in CNN, reduces model parameters, accelerating its computation. Paper proposed a PAC-Bayesian framework that is based on drop-path, it works by identifying the important paths in the CNN model, it can work on multi-layer and multi-branch models resulting in improved performance and speed of the network.
CNN requires a large amount of data to learn features and due to the non-availability of large data techniques like data augmentation are used. Data augmentation is a process that increases the diversity of data without increasing the size of data using techniques like transformation, overfitting, underfitting and it helps to minimize overfitting problems in CNN. Data augmentation on joint training and testing stages can help in optimizing network performance. In CNN overfitting problems exists, to solve this Zheng et al. [26] proposed a full stage data augmentation framework, which can reduce model training cost, the framework has been tested on CIFAR-10 and CIFAR-100 and gives improved generalization. [27] introduced a novel approach of a two-stage method for the training of deep convolution neural networks to improve the generalized ability of CNN by ensuring robustness to the selection of hyperparameter and optimizing feature boundary while initialization hardly affecting the ability of classification of the convergent network model. Further Zheng et al. [24] introduced a technique called layer-wise learning-based stochastic gradient descent method for the gradient-based optimization of the objective function which is a computationally effective and simple technique. The practical performance of the learned model is improved and the training process accelerates. The Generalness and robustness of these methods make it insensitive to hyperparameters which makes this technique more vastly applicable to other datasets and architecture networks. In recent times most astonishing achievement in the field of DRL is the designing of the algorithm which can learn to play 2600 Atari games at the superhuman level directly from pixels of images [13].
In the case of three-wheeled Omnidirectional mobile robots tracking is a difficult task because of the orientation of the wheels which makes it rotate around its axis rather than follow the trajectory. Motivation to use the DRL algorithm is that in traditional reinforcement learning algorithm bellman equation is used which itself mathematically complex to solve and find the optimal solution on a particular state and action. But in DRL this equation is replaced by a neural network that can iterate and describes the best result according to the action and state. We use a neural network to define an actor and critic network to maximize the long-term reward while DDPG is used to train the agent using the reward function which is developed based on the difference between the actual and desired value of the output. DDPG is used because we are considering continuous observation and continuous actions.
The rest of the paper is organized in the following way, in Section 2 introduction to reinforcement learning, deep reinforcement learning and deep deterministic policy gradient are discussed then in Section 3 Dynamic Modeling of the 3WD-Omnidirectional Mobile Robot is derived, in Section 4 DDPG algorithm is described with reward function, environment, and actor-critic networks. Section 5 and 6 describe the results/simulation and conclusion respectively.

Background
Reinforcement learning is a recent and much powerful approach that can be used for wheeled mobile robots, as it enables us to find an optimal solution to a problem with the help of a trial-and-error approach. This technique is based on a neuropsychological cognitive science perspective [2]. Inspired by the behavior of animals where animals, learn to do some specific task to get a reward or to avoid punishment, this technique has the ability to solve many recent complex problems with ease [20]. It is becoming famous among the control enthusiastic because of its model less approach, also known as a black-box approach in which Reinforcement learning can find an optimal solution to a problem for the systems with very complex or high dimensional systems those systems whose modeling itself consider a problem in control system field. A generalized scheme for reinforcement learning and feedback control system is shown in Fig    The mapping of reinforcement learning terms for the control system is given below.

Policy -Policy in a control system is a controller
Environment -Everything in a control system excluding controller is the environment. It shows in Figure 3, that the environment contains the plant, the desired reference, and the error. In general, the environment contains everything elements like disturbance, analog-digital, digital-analog converters, filters, measurement noise, etc.
Observation -Any value that can be measure and visible to an agent. In Figures 2-3, the controller can see the error signal from the environment. We can also develop an agent that can observe outputs, reference signal measurement signals, and rate of change of these signals.
Actions -Actions that can be taken by an actuator in a control system to control a plant.
Reward -Reward is a function of signals which can evaluate the performance of the system according to the requirements. It can include sensors output, error, or some performance metric. For example; we can implement a reward function to minimize the control effort while minimizing the error of a control system.
Learning Algorithm -Learning algorithm is an adaptation mechanism of adaptive control of a system.

Deep Reinforcement Learning
In this paper, Deep Deterministic Policy Gradient (DDPG) as proposed in [11], is used. In DDPG as a baseline of deep reinforcement learning, the actor-critic network is used. Deep reinforcement learning is a blend of deep learning and reinforcement learning. It makes an agent capable of learning to behave in an environment based on feedback rewards or cost function. The main attribute of deep reinforcement learning is that deep neural networks can autonomously explore compact low-dimensional representations (features) of high-dimensional inputs (e.g., text, observations, images, and audio). This field of research has had the option to tackle a wide scope of complex decision-making errands that were already distant for a machine. Along these lines, DRL opens up numerous new applications in spaces, for example, social insurance, mechanical autonomy, Robotics, savvy lattices, and some more.

Deep Deterministic Policy Gradient (DDPG)
For the problems of high dimensionality, complex task, and the environment with continuous action space, only DDPG is used. The deterministic policy gradient algorithm which simultaneously learns Q-Value (max. reward) and a policy. For finding the max. Q-function, the Bellman equation is used. For solving the Bellman equation, there are two methods i.e. Value-based (deterministic policy) wide scope of complex decision-making errands that were already distant for a machine. Along these lines, DRL opens up numerous new applications in spaces, for example, social insurance, mechanical autonomy, Robotics, savvy lattices, and some more. and Policy-based (stochastic policy) [17] were already distant for a machine. Along these lines, DRL opens up numerous new applications in spaces, for example, social insurance, mechanical autonomy, Robotics, savvy lattices, and some more.

Deep Deterministic Policy Gradient (DDPG)
For the problems of high dimensionality, complex task, and the environment with continuous action space, only DDPG is used.

Dy 3WD-Robo
This se three-w robot ha mathem controll where S = set of states, A=set of actions, a= particular action belongs to A, P = probability, R(s) = reward at s state and γ = discount factor. In Value-based, the output is an action while in Policy-based actions are vague. There is always a probability of every possible action. When the action space is confined Q-function is computed using value iteration. In a continuous action space, we cannot evaluate reward every step, quite a time consuming and exhausting. The Q function becomes differentiable concerning the action for every continuous action space. So instead of using the Value iteration, Policy evaluation is used. A deep deterministic policy gradient used the Actor critic algorithm which is in between the Value-based and policy-based shown in Figure 4. The actor uses the policy-based approach in which it learns how to act by directly evaluating the optimal policy. Gradient ascent is used to maximizing the reward. While Critic used the value-based approach. It directly maps the action i.e., the different states.

Dynamic Modeling of the 3WD-Omnidirectional Mobile Robot
This section describes a dynamic model of threewheeled omnidirectional robot. This robot has three Swedish wheel assemblies. The mathematical modeling of the robot is central to controller design. Consider a 3-wheel omnidirectional mobile robot moving on a solid surface. Real-world coordinates system can be assumed as O R : X R Y R whereas the robot coordinates system is O r : X r Y r is static on the center of gravity(cog) for the mobile robot as in Figure 5. While describing the position vector of the center of gravity for a 3-wheel omnidirectional mobile robot-like er a 3-wheel omnidirectional mobile robot g on a solid surface. Real-world coordinates can be assumed as : O X Y is static on the of gravity(cog) for the mobile robot as in 5. While describing the position vector of the of gravity for a 3-wheel omnidirectional ce vector applied to the center of gravity of the and M is mass matrix.
ake the difference of angle between the realcoordinates R X and moving coordinates r X i.e., the rotational angle of the robot coordinate robot coordinate system. Therefore, the following equation is obtained after solving Eq (1) for the robot coordinates system provides ( ) Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; i D is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; i ω is the rate of change of angle of the robot; R I is the moment of inertia of the wheel of robot around the driving shaft; and i u is driving input torque. The geometrical relationships between (1) where real-world coordinates system F R [F x F y ] T is the force vector applied to the center of gravity of the robot and M is mass matrix.
Let's take the difference of angle between the real-world coordinates X R and moving coordinates X r as φ, i.e., the rotational angle of the robot coordinate system with respect to the real-world coordinate system [21]. The transformation matrix to convert robot coordinates to real-world coordinates system is center of gravity(cog) for the mobile robot as in Figure 5. While describing the position vector of the center of gravity for a 3-wheel omnidirectional mobile robot- the force vector applied to the center of gravity of the robot and M is mass matrix.
Let's take the difference of angle between the realworld coordinates R X and moving coordinates r X as ϕ , i.e., the rotational angle of the robot coordinate system with respect to the real-world coordinate system [21]. The transformation matrix to convert robot coordinates to real-world coordinates system is cos sin where the position vector and force vector of center of gravity are In addition, the prop [16] for each assemb where L is the distan center of gravity of gain factor; i D is wheel; r is the radius the viscid resistance the rate of change of moment of inertia o the driving shaft; an The geometrical variables , , it follows that robot coordinates system is : O X Y is static on the center of gravity(cog) for the mobile robot as in Figure 5. While describing the position vector of the center of gravity for a 3-wheel omnidirectional mobile robot- the force vector applied to the center of gravity of the robot and M is mass matrix.
Let's take the difference of angle between the realworld coordinates R X and moving coordinates r X as ϕ , i.e., the rotational angle of the robot coordinate system with respect to the real-world coordinate system [21]. The transformation matrix to convert robot coordinates to real-world coordinates system is cos sin where the position vector and force vector of center of gravity are In addition, the property o [16] for each assembly is t where L is the distance fro center of gravity of the r gain factor; i D is the d wheel; r is the radius of ea the viscid resistance facto the rate of change of angle moment of inertia of the w the driving shaft; and i u is The geometrical rela variables , , r r x y ϕ    and kinematics can be written robot coordinates system is : O X Y is static on the center of gravity(cog) for the mobile robot as in Figure 5. While describing the position vector of the center of gravity for a 3-wheel omnidirectional mobile robot- the force vector applied to the center of gravity of the robot and M is mass matrix.
Let's take the difference of angle between the realworld coordinates R X and moving coordinates r X as ϕ , i.e., the rotational angle of the robot coordinate system with respect to the real-world coordinate system [21]. The transformation matrix to convert robot coordinates to real-world coordinates system is cos sin where the position vector and force vector of center of gravity are In addition, the property [16] for each assembly is where L is the distance fr center of gravity of the gain factor; i D is the d wheel; r is the radius of e the viscid resistance fact the rate of change of angl moment of inertia of the the driving shaft; and i u i The geometrical rel variables , , r r x y ϕ    and kinematics can be written where the position vector and force vector of center of gravity are s r [x r y r ] T and f r [f x f y ] in the robot coordinate system. Therefore, the following equation is obtained after solving Eq (1) for the robot coordinates system provides WD-Omnidirectional Mobile Robot a 3-wheel omnidirectional mobile robot n a solid surface. Real-world coordinates n be assumed as : O X Y whereas the rdinates system is : O X Y is static on the gravity(cog) for the mobile robot as in While describing the position vector of the gravity for a 3-wheel omnidirectional bot- robot coordinate system. Therefore, the following equation is obtained after solving Eq (1) for the robot coordinates system provides ( ) Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving (5) Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
WD-Omnidirectional Mobile Robot a 3-wheel omnidirectional mobile robot n a solid surface. Real-world coordinates n be assumed as : O X Y whereas the dinates system is : O X Y is static on the gravity(cog) for the mobile robot as in hile describing the position vector of the gravity for a 3-wheel omnidirectional bot- robot coordinate system. Therefore, the following equation is obtained after solving Eq (1) for the robot coordinates system provides ( ) Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; D is the driving force of robot (6) WD-Omnidirectional Mobile Robot a 3-wheel omnidirectional mobile robot n a solid surface. Real-world coordinates n be assumed as : O X Y whereas the dinates system is : O X Y is static on the gravity(cog) for the mobile robot as in hile describing the position vector of the gravity for a 3-wheel omnidirectional bot- robot coordinate system. Therefore, the following equation is obtained after solving Eq (1) for the robot coordinates system provides ( ) Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; D is the driving force of robot robot coordinate system. Therefore, the following equation is obtained after solving Eq (1) for the robot coordinates system provides ( ) Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; D is the driving force of robot (8) where I v is robot's moment of inertia, M I is the moment around the center of gravity of the robot, and f x , f y , M I are following: Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving (9) 3WD-Omnidirectional Mobile Robot a 3-wheel omnidirectional mobile robot on a solid surface. Real-world coordinates an be assumed as : Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving (10) f 3WD-Omnidirectional Mobile Robot r a 3-wheel omnidirectional mobile robot on a solid surface. Real-world coordinates can be assumed as : Then, the three-wheeled omnidirectional mobile robot dynamic properties can be described as [5,18].
where v I is robot's moment of inertia, I M is the moment around the center of gravity of the robot, and , , x y I f f M are following: In addition, the property of driving system [22], [16] for each assembly is taken as (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving (11)  In addition, the property of driving system [22], [16] for each assembly is taken as . (11) In addition, the property of driving system [22], [16] for each assembly is taken as , 1, 2,3... (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; i D is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; i ω is the rate of change of angle of the robot; R I is the moment of inertia of the wheel of robot around the driving shaft; and i u is driving input torque. The geometrical relationships between variables , , r r x y ϕ    and i ω i.e., the inverse kinematics can be written as: Using Equations (6) to (15) gives:  (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; D i is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; ω i is the rate of change of angle of the robot; I R is the moment of inertia of the wheel of robot around the driving shaft; and u i is driving input torque. The geometrical relationships between variables φ . , x r . , y r .
and ω i i.e., the inverse kinematics can be written as: In addition, the property of driving system [22], [16] for each assembly is taken as , 1, 2,3... (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; i D is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; i ω is the rate of change of angle of the robot; R I is the moment of inertia of the wheel of robot around the driving shaft; and i u is driving input torque. The geometrical relationships between variables , , r r x y ϕ    and i ω i.e., the inverse kinematics can be written as: Using Equations (6) to (15) gives: In addition, the property of driving system [22], [16] for each assembly is taken as , 1, 2,3... (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; i D is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; i ω is the rate of change of angle of the robot; R I is the moment of inertia of the wheel of robot around the driving shaft; and i u is driving input torque. The geometrical relationships between variables , , r r x y ϕ    and i ω i.e., the inverse kinematics can be written as: Using Equations (6) to (15) gives:  . (11) ddition, the property of driving system [22], for each assembly is taken as , 1, 2,3... (12) re L is the distance from any wheel and the ter of gravity of the robot; k is the driving factor; i D is the driving force of robot el; r is the radius of each wheel of robot; c is viscid resistance factor of the wheel; i ω is rate of change of angle of the robot; R I is the ment of inertia of the wheel of robot around driving shaft; and i u is driving input torque. geometrical relationships between iables , , r r x y ϕ    and i ω i.e., the inverse ematics can be written as: (15) sing Equations (6) to (15) gives: Using Equations (6) to (15) gives: In addition, the property of driving system [22], [16] for each assembly is taken as , 1, 2,3... (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; i D is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; i ω is the rate of change of angle of the robot; R I is the moment of inertia of the wheel of robot around the driving shaft; and i u is driving input torque. The geometrical relationships between variables , , r r x y ϕ    and i ω i.e., the inverse kinematics can be written as: (15) Using Equations (6) to (15) gives:  . (11) In addition, the property of driving system [22], [16] for each assembly is taken as , 1, 2,3... (12) where L is the distance from any wheel and the center of gravity of the robot; k is the driving gain factor; i D is the driving force of robot wheel; r is the radius of each wheel of robot; c is the viscid resistance factor of the wheel; i ω is the rate of change of angle of the robot; R I is the moment of inertia of the wheel of robot around the driving shaft; and i u is driving input torque. The geometrical relationships between variables , , r r x y ϕ    and i ω i.e., the inverse kinematics can be written as: (15) Using Equations (6) to (15) gives:  Model parameter used for the simulation are given in Table 1. Table 1 the action space of DDPG is a continuous while for actor-critic approach have discrete action space. DDPG agents can be trained in environments with continuous or discrete observations and continuous action spaces. In [11], the working and algorithm of DDPG used in this paper. While training, a DDPG agent do the following things: 1) Agent updates critic and actor properties at every time step during training.
2) Using a circular experience buffer, it stores past experiences. The agent updates the critic and actor using a mini-batch of experiences randomly sampled from the buffer.
3) Use noise models to perturbs the action chosen by the policy at every training step.
The following four functions are maintained by a DDPG agent to estimate a value and policy function. approximators: • Actor ( ) S µ : The actor takes observation S and outputs the corresponding action that odel parameter used for the simulation are given Table 1.

ble 1
odel Parameters of 3WD-Omnidirectionl Mobile bot the action space of DDPG is a continuous while for actor-critic approach have discrete action space. DDPG agents can be trained in environments with continuous or discrete observations and continuous action spaces. In [11], the working and algorithm of DDPG used in this paper. While training, a DDPG agent do the following things: 1) Agent updates critic and actor properties at every time step during training.
2) Using a circular experience buffer, it stores past experiences. The agent updates the critic and actor using a mini-batch of experiences randomly sampled from the buffer.
3) Use noise models to perturbs the action chosen by the policy at every training step.
The following four functions are maintained by a DDPG agent to estimate a value and policy function. approximators: • Actor ( ) S µ : The actor takes observation S and outputs the corresponding action that maximizes the long-term reward.
Model parameter used for the simulation are given in Table 1.

Deep Deterministic Policy Gradient (DDPG)
The DDPG algorithm is an off-policy, online, model-free reinforcement learning method. A DDPG agent is based on an actor-critic reinforcement learning agent that maximizes the long-term reward by computing an optimal policy. The main difference between the actor-critic approach and DDPG is that the action space of DDPG is a continuous while for actor-critic approach have discrete action space. DDPG agents can be trained in environments with continuous or discrete observations and continuous action spaces. In [11], the working and algorithm of DDPG used in this paper. While training, a DDPG agent do the following things: 1 Agent updates critic and actor properties at every time step during training.
2 Using a circular experience buffer, it stores past experiences. The agent updates the critic and actor using a mini-batch of experiences randomly sampled from the buffer. 3 Use noise models to perturbs the action chosen by the policy at every training step.
The following four functions are maintained by a DDPG agent to estimate a value and policy function.  A) both have the similar parameterization and structure, and both μ(S) and μ' (S) have the similar parameterization and structure. When training is complete, the trained optimal policy is stored in actor μ(S).

5)
If i S′ is a terminal state, set the value function target i y to i R . Otherwise, set it to The value function target is the sum of the experience reward i R and the discounted future reward.
To compute the cumulative reward, the agent first calculates the successor action bypassing the successor observation i S′ from the sampled experience to the target actor. The agent finds the cumulative reward bypassing the successor action to the target critic. (1 ) For Periodic: The reinforcement learning toolbox of MATLAB 19 is used to create a DDPG agent and the parameters used for the creation of the DDPG agent are as follows.

Actor and Critic Network
The value function target is the sum of the experience reward R i and the discounted future reward.
To compute the cumulative reward, the agent first calculates the successor action bypassing the succes-sor observation S i ' from the sampled experience to the target actor. The agent finds the cumulative reward bypassing the successor action to the target critic. 6 Update the critic parameters by minimizing the loss function f(Loss) across all sampled experiences.
( , ( | ) | ) The value function target is the sum of the experience reward i R and the discounted future reward.
To compute the cumulative reward, the agent first calculates the successor action bypassing the successor observation i S′ from the sampled experience to the target actor. The agent finds the cumulative reward bypassing the successor action to the target critic. 6) Update the critic parameters by minimizing the loss function ( ) f Loss across all sampled experiences.
Update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward.
Here, ai G is the gradient of the critic output with respect to the action computed by the actor-network, ( , ( | ) | ) The value function target is the sum of the experience reward i R and the discounted future reward.
To compute the cumulative reward, the agent first calculates the successor action bypassing the successor observation i S′ from the sampled experience to the target actor. The agent finds the cumulative reward bypassing the successor action to the target critic. 6) Update the critic parameters by minimizing the loss function ( ) f Loss across all sampled experiences.
Update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward.
Here, ai G is the gradient of the critic output with respect to the action computed by the actor-network, The value function target is the sum of the experience reward i R and the discounted future reward.
To compute the cumulative reward, the agent first calculates the successor action bypassing the successor observation i S′ from the sampled experience to the target actor. The agent finds the cumulative reward bypassing the successor action to the target critic. 6) Update the critic parameters by minimizing the loss function ( ) f Loss across all sampled experiences.
Update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward.
Here, ai G is the gradient of the critic output with respect to the action computed by the actor-network, Discount Factor = 0.9 Experience buffer l smooth factor = 1 10 Here, G ai is the gradient of the critic output with respect to the action computed by the actor-network, and G μi is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation S i .

Actor and Critic Network
The actor and critic network are defined by the help of deep neural network toolbox and design to create actor-network which intakes observation and outputs action which in case of a 3WD-Omnidirectional mobile robot is the motor speed of three Swedish wheels.
Observations that are used for this system are x, y, θ, x . , y . , θ . , x e , y e , x . e , y . e and motor speeds from previous agent. The steps are as follows to create a good actor and critic network. 1 Start with the smallest possible network and a high learning rate (0.01). Train this initial network to see if the agent converges quickly to a poor policy or acts randomly. If either of these issues occurs, rescale the network by adding more layers or more outputs on each layer. The goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.
2 Initially configure the agent to learn slowly by setting a low learning rate. By learning slowly, it can be checked to see if the agent is on the right track, which can help verify whether the network architecture is satisfactory for the problem. For difficult problems, tuning parameters is much easier once Critic Neural Network we settle on good network architecture. Figure 6, shows the graphical representation of the critic neural network. Setting for NN actor-critic networks are optimizer = adam, learn rate= 1×10 -3 , Gradient threshold = 1, Regularization factor = 1×10 -5 .

Reward Function
The main purpose of the paper is to track a reference trajectory, where the main task is to minimize the error function so one can design a reward function based on an error signal. The error signal used for the simulations is as follows. Simulink representation of total reward function R1 is shown in Figure 7.

Reward Function
The main purpose of the paper is to track a reference trajectory, where the main task is to minimize the error function so one can design a reward function based on an error signal. The error signal used for the simulations is as follows. Simulink representation of total reward function 1 R is shown in Figure 7. Whereas the second reward function 2 R is defined which is simple but reward increases as error decrease dynamic model of the system Environment created for this paper then this block is integrated with the RL agent which learns the policy and implemented it on the dynamic model of the system Figure 9.  2) θ <

Environment
In terms of reinforcement, the learning environment is everything except the agent. The environment includes the plant, the desired reference, and the error. In general, the environment also contains some other elements like disturbance, analog-digital, digital-analog converters, filters, measurement noise, etc.

Figure 7
Reward Function Simulink Representation In the case of 3 wheeled omnidirectional mobile robot environment block is created in Simulink which includes reward function, exceed bound limits, observations. Figure 8, shows the dynamic model of the system Environment created for this paper then this block is integrated with the RL agent which learns the policy and implemented it on the dynamic model of the system Figure 9.

Figure 8
Environment for DDPG Agent

Environment
In terms of reinforcement, the learning environment is everything except the agent. The environment includes the plant, the desired reference, and the error. In general, the environment also contains some other elements like disturbance, analog-digital, digital-analog converters, filters, measurement noise, etc.
In the case of 3 wheeled omnidirectional mobile robot environment block is created in Simulink which includes reward function, exceed bound limits, observations. Figure 8, shows the dynamic model of the system Environment created for this paper then this block is integrated with the RL agent which learns the policy and implemented it on the dynamic model of the system Figure 9. RL agent takes an observation, reward function, and flag function which shows if the simulation is done or not as an input and outputs the motor speeds of 3-wheeled omnidirectional mobile robots.

Results and Simulations
Simulation for the validation of the results has been done in MATLAB 19 and the Reinforcement learning toolbox is used for environment creation, actor-critic networks, agent, and training of that agent. To reduce complexity, the simulation range of the motor's inputs is selected [0 -∞] where two motors M1 and M2 are set as positive while third M3 is set to move opposite to the first two motors. This is done to limit the rotation of the 3WD-Omnidirectional mobile robot along its axis. Two different scenarios for the trajectories are used to validate the results first scenario is to track a point to point with a straight line this is the simplest scenario because as the robot advance forwards its angle φ remain constant. While in the second scenario Control Inputs tracking of circular trajectory is used because it is a complex trajectory for the 3WD omnidirectional robot because φ changes at each point of the circle.

Scenario 1
For initial training reference is given as a point-topoint tracking, Simulation results are given in Figure 10, shows the control inputs to the motors (M1, M2, M3). Figure 11, Shows the no of iteration and different rewards on each iteration which includes episode reward, average reward, and expected reward while Figure 12, shows the results of point-topoint tracking of the 3-wheel omnidirectional mobile robot.
The simulation stops when the average reward reaches to 1000. Iteration's graph shows that for about 100 iterations there is nothing special happen then suddenly Neural networks of actor and critics start to predict the inputs where reward function maximizes. Stopping criteria are selected by monitoring the average reward. It is because each episode reward is very random and can go to the maximum value and minimum value at any time.

Scenario 2
In this scenario, a sine wave is applied as a reference of the x-axis while the cosine wave is applied as a reference of the y-axis. Which combines to make a circle to the trajectory for reference. Simulation results are given in Figure 13, shows the iteration for the tracking of circular trajectory simulations stops when the average reward approaches 1900. Figure 14, shows the error signal of the x-y axis for the tracking of circular trajectory error starts at maximum because robot initial position is at the origin then it starts follows the circle and error become zero. Figure 15, shows the result for circular trajectory tracking of 3 wheels omnidirectional mobile robots. While the reset function is set to come to the origin when every iteration ended.

Conclusion
To achieve tracking of 3 wheels omnidirectional mobile robot (deep reinforcement learning) DRL algorithm (deep deterministic policy gradient) DDPG is used which allows us to achieve our goal by taking continuous actions and states. To attain a control objective, less calculation is needed compare to the full optimal control algorithm, and we always got more accuracy, compared to a typical control method. MATLAB R2019a is used for the simulation and the reinforcement learning toolbox makes the whole work very easy. The best part of this technique is that we can achieve a goal with no or very less knowledge of the dynamic model and it will work on that too. This research is very useful where a robot has to do a task repeatedly millions of times like automatic mobile assembly, automatic sorting a book in the library, robots working in congested areas, planetary exploration, etc. Further research can be carried out by attaching a traditional feedback controller with reinforcement learning to achieve more fast and better results.