Novel Algorithm for Agent Navigation Based on Intrinsic Motivation Due to Boredom

We propose a novel algorithm for the navigation of agents based on reinforcement learning, using boredom as an element of intrinsic motivation. Improvements obtained with the inclusion of this element over classic strategies are shown through simulations. Boredom is modeled through a chaotic element that generates conditions for the creation of routes when the environment does not offer any reward, allowing prompting the robot to navigate. Our proposal seeks to avoid what classical algorithms suffer in scenarios without rewards, generating losses of time in the resolution. We demonstrate experimentally that by adding the element of boredom it is possible to generate routes in scenarios in which rewards do not exist, allowing the use of these strategies in real circumstances and facilitating the robot's navigation towards its objective. The most important contribution sustained by this work corresponds to the fact that it is possible to improve navigation in completely adverse scenarios for a navigation algorithm based on rewards.


Introduction
Reinforcement learning (RL) is of the most common techniques in the field of machine learning [15,29]. Its form of operation is based on how human beings learn, considering learning by conditioning one of its influences [7].
In general, reinforcement learning values the correct execution of actions and punishes the wrong decisions. However, the environment skews the options that the agent can take, therefore, it becomes an element that seeks to maximize an internal objective function, constantly learning from the problem through trial and error. A model of an agent based on reinforcement learning can be visualized in Figure 1. There are multiple algorithms where extrinsic reward elements are considered to improve learning as indicated in [1,25], however, a specific dependence of the entity is observed with respect to what the environment can offer it. Currently the investigations developed by different authors integrate bio-inspired behaviors, such as the work [21] where a framework for the interaction of robots with humans is developed.
As robotic agent, what happens if the natural environment does not offer the rewards that the algorithm needs to function? In general, we can say that the entity could not operate creating havoc in the way of executing actions. On the contrary, human beings have the ability to determine their objectives considering their particular abilities [27], due to this the context in which the human finds himself does not determine how far he will be able to execute a certain task. This attribute can be linked to works developed on emotions as portrayed by [14,26]. This capacity, which, based on the impulse to explore the environment spontaneously, is visualized in the works of [8] and widely discussed by [2], is known as intrinsic motivation (IM) and becomes an aspect to be considered to avoid failures in the algorithms that are only tested in simulation form. Learning by motivation is subject to learning by reinforcement. Considering the growing wave in which researchers have made important efforts to try to define emotions, which are basically rooted in the area of psychology, within the problems of computational learning, study models have been generated such as those exposed in [4,11,17] where indicators such as curiosity, novelty, pain, surprise, among others associated with motivation, are used. Other aspects where this type of bio-inspired algorithm is applied can be seen in [28], the proposal involves the development of an algorithm to emulate the cerebellum and ganglia interaction. Psychologically it is accepted that learning is a process in which practical experience produces a change in behavior, therefore, there is an internal element that generates the expectations of this learning, we call this intrinsic motivation [19] and it is considered as a mechanism that encourages species to achieve objectives. This mechanism can be seen associated with both internal and external factors Figure 2, this edge being an area of interest for research [22], since internal intentions or what really moves robots to fulfill a goal may not be the same for everyone, offering the possibility of establishing differences between a set of homogeneous robots since their constitution, but being absolutely heterogeneous regarding their condition towards the objective. This concept is widely treated in different theories as portrayed in [10,13] that associate motivation with intangible elements such as expectation, incentive or boredom [30].
Other authors have developed experiments with animals where they have sought to measure reinforcement learning considering intrinsic motivation [16] to find behavioral models.
The nature of boredom and the positives effects on motivation represent a starting point for this work [6], modeling this condition through a chaotic element that generates conditions for the creation of routes when the environment does not offer any reward.
Our objective is to compare navigation strategies for robots, through the application of a proprietary RL algorithm based on intrinsic motivation driven by boredom.

Methodology
Being then the problem of positioning a robot in space and how it will reach an objective point, the problem is defined in one related to dynamic programming using the Bellman equation, [3] where it is possible to define the algorithm as shown in Equation (1), where V(s) it corresponds to the value of being in a state, R(s, a) represents the reward function in a current state s and taking an action a, V(s') represents the value of the new state s' if the action a is taken and γ is a discount factor that weighs the decisions that the entity will make allowing future decisions to be evaluated. allowing future decisions to be evaluated.
Being then the case that the process of determining the direction of navigation will depend with total freedom on the entity, as specified in [18], it is possible to rewrite Equation (1) and express it as it is formalized in Equation (2), where the probabilities of all possible decisions are analyzed when the robot is at a certain point in space as represented in Figure 3.

Algorithm Based on Boredom Motivation
Through what is indicated in the literature, structures were developed using novelty as an element to give an stimulus [12] or the implementation of a dynamic controllers based on curiosity and boredom are demonstrated in [23] these related works are based in Q.learning algorithms with a method of intrinsic motivation, another case where is used boredom and curiosity is [31]. These research demonstrate that (1) Being then the case that the process of determining the direction of navigation will depend with total freedom on the entity, as specified in [18], it is possible to rewrite Equation (1) and express it as it is formalized in Equation (2), where the probabilities of all possible decisions are analyzed when the robot is at a certain point in space as represented in Figure 3.
allowing future decisions to be evaluated.
Being then the case that the process of determining the direction of navigation will depend with total freedom on the entity, as specified in [18], it is possible to rewrite Equation (1) and express it as it is formalized in Equation (2), where the probabilities of all possible decisions are analyzed when the robot is at a certain point in space as represented in Figure 3. (2)

Figure 3
Grid world visualized by the robot.

Algorithm Based on Boredom Motivation
Through what is indicated in the literature, structures were developed using novelty as an element to give an stimulus [12] or the implementation of a dynamic controllers based on curiosity and boredom are demonstrated in [23] these related works are based in Q.learning algorithms with a method of intrinsic motivation, another case where is used boredom and (2) Considering the above, when the scenario does not offer alternatives that provide the algorithm with a reward for its execution of tasks, the entity begins to perform random actions, taking this as a basis, two classic RL algorithms are studied, this algorithm was previously compared without one intrinsic motivation [24] in mobile robot path planning.

Q-Learning
Learning algorithm that seeks to maximize the future reward through the exploration of all the possible solutions that could be had for a displacement, each iteration is stored in a table called Q table generating policies and displacement actions.
The model can be visualized as expressed in Equation (3), where Q(s, a) represents a state-action set whatever where the robot is, ξ represents the learning coefficient, γ is a discount factor to weight the behavior of taking a new state called s'.
decisions are analyzed when the robot is at a certain point in space as represented in Figure 3.

Figure 3
Grid world visualized by the robot.
Considering the above, when the scenario does not offer alternatives that provide the algorithm with a reward for its execution of tasks, the entity begins to perform random actions, taking this as a basis, two classic RL algorithms are studied, this algorithm was previously compared without one intrinsic motivation [24] in mobile robot path planning.

Q-Learning
Learning algorithm that seeks to maximize the future reward through the exploration of all the possible solutions that could be had for a displacement, each iteration is stored in a

SARSA
Policy-based learning algorithm Markov decision process (MDP), bases its operation on updating a Q table that depends on the state and action selected by the agent Q (s, a), the reward r will be selected according to that action and the new state is executed s' which involves a change to the new action a'. The system model can be visualized in Equation 4.
allowing future decisions to be evaluated.
Being then the case that the process of determining the direction of navigation will depend with total freedom on the entity, as specified in [18], it is possible to rewrite Equation (1) and express it as it is formalized in Equation (2), where the probabilities of all possible decisions are analyzed when the robot is at a certain point in space as represented in Figure 3.

Figure 3
Grid world visualized by the robot

Algorithm Based on Boredom Motivation
Through what is indicated in the literature, structures were developed using novelty as an element to give an stimulus [12] or the implementation of a dynamic controllers based on curiosity and boredom are demonstrated in [23] these related works are based in Q.learning algorithms with a method of intrinsic motivation, another case where is used boredom and curiosity is [31]. These research demonstrate that boredom is a enabled to curiosity.
It is possible to establish a model based on the Q-learning SARSA structure and apply new variables in decision making powered only by boredom considering the work [5], where the author exposed the boredom how a state that can motivate one to pursue a new goal when the actual state feeling is unsatisfactory.
Considering that the SARSA algorithm has a better response in growing scenarios [9], a condition called boredom is applied.
This condition occurs in the worst case scenario for the RL algorithms, which occurs when the medium does not offer any reward, therefore the matrix R(s, a) = 0, this condition implies that none of the available options attract you to something.
Taking what is stated by some authors in the theory of self-determination, it is possible to define boredom as an instance where creativity has its origin and therefore it is possible to use it as a catalyst towards intrinsic motivation.
Therefore, the state of boredom can be described as a random element that will lead us to two possible conditions, a) maintaining the current dissatisfied condition or b) propelling ourselves to a state of creativity This duality is portrayed in the completeness of the scenario of possible rewards, and this is represented in Equation (5), where are assigned either in the half or in the whole set R(s, a) = 0. The value of 0.5 is defined as the cut-off threshold for the Bored variable considering the criterion of maximum variance defined as M.
Assuming that the environment where the algorithm will be applied is unknown in size, this criterion provides guarantees by granting the same occurrence possibility to the situations in which the universe will be completed.
occurrence possibility to the situations in which the universe will be completed. . (5) To avoid that the values used are distributed in a normal way, a chaotic function is used based on the Chua oscillator model in its discrete form as seen in Equation (6), where the term it is developed in the Equation (7) as exposed by [20] ( ) In Algorithm 1 the proposal for the integration of boredom in the SARSA flow is displayed.

Step 3 Boredom = Random
Step 4 If (Boredom>0.5) Step  (5) To avoid that the values used are distributed in a normal way, a chaotic function is used based on the Chua oscillator model in its discrete form as seen in Equation (6), where the term f(x tk-1 )) it is developed in the Equation (7) as exposed by [20] occurrence possibility to the situations in which the universe will be completed. . (5) To avoid that the values used are distributed in a normal way, a chaotic function is used based on the Chua oscillator model in its discrete form as seen in Equation (6), where the term it is developed in the Equation (7) as exposed by [20] ( ) In Algorithm 1 the proposal for the integration of boredom in the SARSA flow is displayed.

, [ ] R s a M z =
Step 7 Else if (Boredom <= 0.5) Step Step Step Step Step occurrence possibility to the situations in which the universe will be completed.

R
To avoid that the values used are distributed in a normal way, a chaotic function is used based on the Chua oscillator model in its discrete form as seen in Equation (6), where the term it is developed in the Equation (7) as exposed by [20] ( ) In Algorithm 1 the proposal for the integration of boredom in the SARSA flow is displayed.

, [ ] R s a M z =
Step 7 Else if (Boredom <= 0.5) Step Step Step Step Step

R
In Algorithm 1 the proposal for the integration of boredom in the SARSA flow is displayed.

Step 5 Function Chua return M [x, y, z]
Step 6

R(s, a)= M[z]
Step 7 Else if (Boredom <= 0.5) Step Step 10 End if Step 11 Function SARSA-Agent(perception) return an action Step 12 End if

Results
Considering the training of 2 agents under normal operating conditions in a known world of size 7x7, it is possible to observe a slight superiority of the Q-Learning algorithm with respect to SARSA in the time of convergence towards a solution, however, this training process is performed under normal conditions with a specific reward.
When the universe does not deliver any reward, it is possible to observe how the algorithm tries to converge on some viable result, but they remain at 0 Figure 4 A) and B), contrary to the proposed algorithm, since in any of the conditions that arise it generates training patterns.

Figure 5
Agent behaviour in training world when boredom is greater 0.5

Figure 6
Agent behaviour in training world when boredom is less than 0.5 In C) the agent converges in a route at 70 iterations that are carried out with the entire reward matrix with values obtained from the Chua function, on the other hand in D) it is visualized that the system takes longer to generate a route, However, here only 50% of the matrix has rewards that are enough to take the learning system out of inertia and generate a route to the destination.
The routes traced in both cases are completely different, as can be seen in Figure 5 and Figure 6. This has effects on the way in which the agent faces the journey in the world, since the training and knowledge acquired in this process is vital. importance for behavior in the environment, two cases can be considered good, because the agent can arrive at the proposed destination and the reward is different than 0.
Taking this consideration, it is possible to generate a comparison between the different tests developed between the analyzed algorithms Table 1, where when focusing on points that the routes do not have within the training pattern, navigation failures are observed, as can be seen from what happened with Q-Learning and Boredom <0.5 when we refer to the point (6, 2). The absolute failure of most of the algorithms is visualized in Table 2, where the Boredom <0.5 algorithm was the only one to navigate to that point. The map used to carry out the tests had the same size, however, the non-displacement zones were modified, as can be seen in Figure 7, which shows the agent reaching the most complicated position for all the rest of the elements. The most eloquent results on the effectiveness of the navigation method using boredom intrinsic element of motivation are observed when viewing the Q tables of each of the proposed methods.  Table 2 Behavior in navigation before target point (3,5) outside the training routes The Q tables show the relations between movements alternatives (forward, right, left, back) and target reward obtained for that decision. Figure 8 and Figure 9 portray the situation where even though the environment does not offer any rewards, the system generator allows navigation, giving the agent Behavior of possible alternatives according to the agent's decision to move when boredom is less than to 0.5

Figure 9
Behavior of possible alternatives according to the agent's decision to move when boredom is greather than or equal to 0.5 different alternatives to do a movement and show rewards different to cero.
Both responses differ in convergence times and the decisions that the robot executes. The latter can demonstrate how boredom influences the decisions, allowing the construction of two different solutions in the path planning.

Figure 10
Behavior of the Q table in Q-learning algorithm when exposed to an enviroment of 0 values

Figure 11
Behavior of the Q table in Q-learning algorithm when exposed to an enviroment of 0 values In contrast to its peers such as Q-learning and SAR-SA that do not provide the robot with options to perform any movement portrayed in Figure 10 as the agent try to move selecting the movement 1 (forward) but in all cases the reward obtained is near to cero, that implies the agent in all cases is not going to nothing in the Figure 11, the case is different because the algorithm intends to give some possibilities of movement and the agent moves, but does not arrive at the destiny.

Conclusion
From the different tests carried out, it is possible to deduce that the inclusion of the algorithm boredom motivation as a generator of intrinsic rewards proposes an improvement in the agent training process because under reward conditions 0, the system uses values that get around this problem providing the possibility of an algorithm where boredom powered by a chaos number generator is the main element to catalyze motivation.
The proposal generates a possible solution to navigation in aggressive environments for the algorithm, especially when environmental conditions offer no reward for travel, this can be used in path planning or in the training process.
Considering that the navigation system generates alternative routes in the training process, it is pro-posed to develop a mixture of both options as the algorithm have boredom greater than 0.5 and less 0.5 that allows, therefore, to know more travel options, this according to the data generated would allow that despite not finding the solution in one of the reward tables, you can jump to another that does contain it.
For future works, the application of these algorithms in a set of robots is proposed so that the navigation information is shared.