Constraint-aware Policy Optimization to Solve the Vehicle Routing Problem with Time Windows

The vehicle routing problem with time windows (VRPTW) as one of the most known combinatorial operations (CO) problem is considered to be a tough issue in practice and the main challenge of that is to find the approximate solutions within a reasonable time. In recent years, reinforcement learning (RL) based methods have gained increasing attention in many CO problems


Introduction
The vehicle routing problem (VRP), as an enduring problem of operations research, has been studied for decades due to its wide application in various fields, like logistics, transportation, and manufacturing [6]. Such problem aims to find the optimal routes for available vehicles to travel in order to satisfy the demands of customers under certain constraints [16]. Many exact, approximate, and heuristic methods have been proposed, some of them are state-of-the-art [7,8,25,30]. With the development of deep neural networks, deep reinforcement learning (DRL)-based methods, as a kind of heuristic method, have gained attention due to their enormous potential to efficiently generate high-quality solutions [3,13,23].
The DRL-based method is to solve the VRP by training a neural network (NN) model that can map between the state space and the optimal solution. The model consists of many components, the most representative of which is the representation learning and reinforcement learning (RL). Representation learning can exact the efficient feature vectors from the raw data, while RL is to train the whole NN without the optimal solutions as the labels. Recent research mainly focuses on the improvement of representation learning to advance performance. Vinyals et al. [31] presents the pointer network that first treats each option element as a pointer and employs the recurrent neural networks (RNN) to represent the dynamic changes of the features. Bello et al. [3] introduces the attention mechanism to put weights on the different features vectors to improve the probabilities of the pointing mechanism. Nazari et al. [23] and Kool et al. [15] design a method with novel attention mechanism respectively that outperforms the state of the art.
However, the previous work lacks further research on constraints, especially its impact on reinforcement learning. Because reinforcement learning is a way of learning through trial-and-error exploration [3,13,15,23], the quality of trained policy is greatly affected by the exploration strategy under constraints that define the boundary of the solution space [26,28]. When using reinforcement learning to train neural network models, these methods only mask the solutions that do not satisfy the constraints through 'mask', and do not consider how to make the agent aware of the existence of constraints in a certain state. The neglect of information contained in constraints makes them difficult to find the optimal solutions. Intuitively, as constraints become more complex, the awareness of constraints becomes more necessary.
The VRP with time windows (VRPTW), as a variant of VRP with complex time window constraints, has been widely studied and is ( )-hard to solve. In such problem, a fleet of identical vehicles serve multiple customers along optimal routes subject to the following constraints: (1) each vehicle can only start from, end at, and acquire items from the depot; (2) each vehicle must visit customers within a specified time interval (time window); (3) total demands of customers served by a single vehicle cannot exceed its capacity; and (4) all demands must be satisfied. The objective is to minimize the sum distance of all tours. The existence of mass complex constraints limits the performance of previous methods, so it is necessary to improve the learning process of RL.
To investigate these problems, we present constraint-aware policy optimization (CPO) to impose on our agent to learn the probability distribution of constraints. A method based on Kullback-Leibler (KL) divergence is constructed to calculate the distance of probability distribution between the constraint and the policy, which can obtain the Information Technology and Control 2022/1/51 128 gradient to guide learning. Our work comprises four main steps. 1) To formally represent the VRPTW in a sequential decision process for RL, we convert the constrained route planning problem to a deterministic constrained Markov decision process (DCMDP), to which we extend the original policy gradient method. 2) To guide the policy to learn the features of constraints, a constraint-aware training scheme is proposed, which can enhance our method's performance, including a predictive part to predict the current constraints according to the state information, and an inference part to make decisions according to the constraints. 3) To alleviate the sparse reward problem, we use successor representation (SR) as an indicator to guide the choice of actions. 4) An RL training framework for the VRPTW and experiments on the Solomon benchmark and the generated datasets show that our method outperforms other competition algorithms.
The remainder of this paper is organized as follows. Section 2 provides an overview of related works. Section 3 discuss the basic ideas of the deterministic constrained Markov decision process, VRPTW, and successor representation. Three key components of the CPO are presented in Section 4. In order to verify our method, the experiments are shown in Section 5. In Section 6, we present the conclusion.

Related Work
Our work is closely related to integral linear programming, heuristic approaches of the VRP, learning-based routing, and constrained RL.

Integral Linear Programming and Heuristic Approaches
These approaches are commonly used to solve the VRP and its variants, and are formulated as mixed integer linear programming (MILP) problems.
Branch and bound is one of the most famous exact algorithms for the VRPTW. It represents candidate solutions as a tree, and prunes solutions that are beyond the slack upper bound [30]. This approach can find the optimal solutions, but it becomes intractable as the number of decision objects increases. Con-versely, heuristic approaches, such as genetic and ant colony algorithms, prefer scalability to optimality, and thus can find approximate solutions of largescale problems in a relatively short time. However, heuristic algorithms are sensitive to parameters and weak in robustness. All the above methods are designed to solve problems case-by-case. They have to search solutions anew when they face a new instance, which is time-consuming. That is the major obstacle of these methods in practice.

Learning-Based Routing
Learning-based routing approaches are common due to their efficient solutions and stable performance. Their methods are based on either supervised learning (SL) or RL. Introduced by Hopfield et al. [12], SL-based methods aim to leverage neural networks to find solutions in a supervised manner. Since they require expert guidance, SL methods are intractable in environments where traditional methods cannot work [13,17]. RL-based methods attempt to find approximate solutions through trial and error [3,13,15,23]. While capable of good performance, they do not work well when the constraints become intricate. Our method is different in that we attempt to enhance the performance through a constraint-awareness learning scheme and an indicator to guide our policy to learn from constraints.

Constrained Reinforcement Learning
Constrained RL is a trending topic because in real-world agents are strictly restricted to avoid dangerous actions [1,4,32]. Widely used methods include expert-guided action intervention [32], Lagrangian relaxation [29], and probability guided exploration [4]. We take motivation from them, but VRP constraints are deterministic to the environment (as discussed in the next section). We leverage these properties to develop a training scheme that makes our method more robust and stable.

Background
We discuss the basic ideas of the deterministic constrained Markov decision process (MDP), VRPTW, and successor representation.

DCMDP
A deterministic constrained Markov decision process can be formulated as a tuple ⟨ S, A, R, p, T, C ⟩, where is a set of states, is a set of actions, and ( +1 | , +1 ) is a transition function. Different from the standard MDP method, a set of constraints is represented in deterministic constrained MDP, which varies by the environment (load constraints in VRP, and load and time-window constraints in VRPTW). Let t : → {0,1} indicate the constraint for action at step ; t = 0 means that the constraint exists, and t = 1 means that the constraint non-exists. : × × → is the reward function. The goal of RL is to find a policy π: × → that maximizes the expected return = ∑ t=1 γ -1 in the trajectory τ ≔ ( 1 , 1 , 1 … , , , ) within steps. γ is a discount factor.
The DCMDP is a special case of a constrained MDP [2]whose constraints are deterministic, meaning that ( | ) = 1 ( ( | ) = 1 indicates that the relationship between constraints and the environment is certain; hence, learning the constraints implies learning the environment as well) and is known to the agent before taking action. This property leads to a different solver from Altman et al. [2].

VRPTW
For VRP, suppose there exist a depot and | | − 1 customers in different locations, for | | total nodes, where is the set of nodes. The depot is a special node at which vehicles must start, end, and collect items. For any nodes , , we have , > 0, ∀ , ∈ , ≠ , where , is the distance between nodes and . The vehicle sends items to customers and should return to the depot when its residual carrying capacity cannot satisfy any customer demand. Given enough vehicles having the same capability and starting from the depot, our goal (objective) is to satisfy all the customers' demands with the shortest path. The environment and goal of the VRPTW are similar to those of the VRP except that each customer owns its individual available time wi (or time windows). Moreover, we assume that all vehicles have the same velocity and must arrive within a customer's time window.
VRP are often represented as MILP. However, MILP does not fit the RL form well (since RL is always rep-resented as a Markov decision process) and to put it in the RL framework requires a transformation. It has been proved that a traveling salesman problem (TSP) can be transformed to an MDP [3]. To strengthen this conclusion, we provide a lemma. Lemma 1. Standard VRP and VRPTW problems can be converted to DCMDP. Proof 1. Through giant-tour representation [10], the VRP and VRPTW can be treated as a special form of the TSP (i.e., in some states, some actions are restricted and cannot be chosen). According to Bello et al. [3], the TSP can be transformed to an MDP. Thus, we can conclude that the VRP and VRPTW can be formed as a DCMDP.
Remark. Lemma 1 reveals the interesting phenomena that we can formulate VRP as a DCMDP and represent the load and time windows as constraints.
The trajectory probability of the total process for the DCMDP is through a constraint-awareness learning scheme and an indicator to guide our policy to learn from constraints.

Constrained Reinforcement Learning
Constrained RL is a trending topic because in real-world agents are strictly restricted to avoid dangerous actions [1,4,32]. Widely used methods include expert-guided action intervention [32], Lagrangian relaxation [29], and probability guided exploration [4]. We take motivation from them, but VRP constraints are deterministic to the environment (as discussed in the next section). We leverage these properties to develop a training scheme that makes our method more robust and stable.

Background
We discuss the basic ideas of the deterministic constrained Markov decision process (MDP), VRPTW, and successor representation.

DCMDP
A deterministic constrained Markov decision process can be formulated as a tuple ⟨ S, A, R, p, T, C ⟩, where is a set of states, is a set of actions, and ( �1 | , �1 ) is a transition function. Different from the standard MDP method, a set of constraints is represented in deterministic constrained MDP, which varies by the environment (load constraints in VRP, and load and timewindow constraints in VRPTW). Let : → {0,1} indicate the constraint for action at step ; = 0 means that the constraint exists, and = 1 means that the constraint non-exists. : × × → is the reward function. The goal of RL is to find a policy π: × → that maximizes the expected return = ∑ γ �1 �1 in the trajectory τ ≔ ( 1 , 1 , 1 … , , , ) within steps. γ is a discount factor.
The DCMDP is a special case of a constrained MDP [2]whose constraints are deterministic, meaning that ( | ) = 1 ( ( | ) = 1 indicates that the relationship between constraints and the environment is certain; hence, learning the constraints implies learning the environment as well) and is known to the agent before taking action. This property leads to a different solver the VRPTW are similar to those of the VRP except that each customer owns its individual available time w i (or time windows). Moreover, we assume that all vehicles have the same velocity and must arrive within a customer's time window.
VRP are often represented as MILP. However, MILP does not fit the RL form well (since RL is always represented as a Markov decision process) and to put it in the RL framework requires a transformation. It has been proved that a traveling salesman problem (TSP) can be transformed to an MDP [3]. To strengthen this conclusion, we provide a lemma.
Lemma 1 Standard VRP and VRPTW problems can be converted to DCMDP.
Proof 1 Through giant-tour representation [10], the VRP and VRPTW can be treated as a special form of the TSP (i.e., in some states, some actions are restricted and cannot be chosen). According to Bello et al. [3], the TSP can be transformed to an MDP. Thus, we can conclude that the VRP and VRPTW can be formed as a DCMDP.
Remark. Lemma 1 reveals the interesting phenomena that we can formulate VRP as a DCMDP and represent the load and time windows as constraints.
The trajectory probability of the total process for the DCMDP is where represents the weights of the policy, �1 is the terminal state, and ( � | , ) = 1. This trajectory is based on the probability graph model shown in figure 1. Specifically, for the VRP, means the load constraints, while for the VRPTW, means the constraints of both the load and time windows of customers. Our goal is to find an (1) where represents the weights of the policy, +1 is the terminal state, and ( + | , ) = 1. This trajectory is based on the probability graph model shown in figure 1. Specifically, for the VRP, means the load constraints, while for the VRPTW, means the constraints of both the load and time windows of customers. Our goal is to find an optimal policy * that maximizes the total reward .

Successor Representation
Introduced to describe cognitive phenomena in the human brain, successor representation (SR) focuses on the extraction of important states to aid training [20]. We leverage SR as a behavior indicator, regarding states that may lead to a lower total reward as ill-famed, and using SR as a behavior indicator to avoid actions to those states. For example, in the VRP, we might not be willing to let our vehicle return too soon to the depot; the ill-famed state is the situation in which the agent is at the depot without having depleted its load.

Methodology
Our methodology has three parts: first, we give the policy gradient for the DCMDP to formally represent a simple solver for VRP; then, a constraint-awareness module is represented to diminish actions against constraints; finally, an SR-based method is designed to address the sparse reward problem.

Policy Gradient in DCMDP
A policy gradient for the DCMDP must be built to train our policy function. The trajectory probability ( )reveals the rollout process. Taking logarithms on both sides, we have optimal policy * that maximizes the total reward .

Successor Representation
Introduced to describe cognitive phenomena in the human brain, successor representation (SR) focuses on the extraction of important states to aid training [20]. We leverage SR as a behavior indicator, regarding states that may lead to a lower total reward as ill-famed, and using SR as a behavior indicator to avoid actions to those states. For example, in the VRP, we might not be willing to let our vehicle return too soon to the depot; the ill-famed state is the situation in which the agent is at the depot without having depleted its load.

Figure 1
The rollout processes. A vehicle starts from the depot and must deliver items to customers (represented as 1 … 4 ). For example, in the first step, nodes 1 and 3 are banned for some reason (the arrow from constraint to action is black). Thus, actions can only be chosen in 1 and 4 . Under the greedy strategy, the vehicle chooses the action with the maximum probability. This continues from step 1 to T, forming the rollout trajectory.

Methodology
Our methodology has three parts: first, we give the policy gradient for the DCMDP to formally represent a simple solver for VRP; then, a constraint-awareness module is represented to diminish actions against constraints; finally, an SR-based method is designed to address the sparse reward problem.

Policy Gradient in DCMDP
A policy gradient for the DCMDP must be built to train our policy function. The trajectory probability ( ) reveals the rollout process. Taking logarithms on both sides, we have log ( ) = log ( ) + ∑ log ( | , ; ) To generalize, we take the expectation of trajectories, and with Jensen's inequality [24], the expectation can be formed as ( , ) is similar to the baseline methods in standard variance reduction RL. Differently, we extend the baseline function with constraints as input.
However, ( , ) is for the one-step baseline function. To extend it to the trajectory form, we can simply sum them up as � (κ) = ∑ ( , ) �1 , or build a neural network to represent it as � (κ; λ), where \lambda denotes the weights for the neural network and κ = [ 1 , 1 , 2 , 2 … , ] is the vector of whole states and constraints for a trajectory. Since ( �1 | , ) and ( | ) are irrelevant to θ, based on the analysis above, we have the policy gradient for DCMDP: (2) To generalize, we take the expectation of trajectories, and with Jensen's inequality [24], the expectation can be formed as The right side of the equation is the lower bound of the form of the expectation of trajectories. With reinforcement learning [28], the function τ∼ (⋅) (log (τ) (τ)) can be written as optimal policy * that maximizes the total reward .

Successor Representation
Introduced to describe cognitive phenomena in the human brain, successor representation (SR) focuses on the extraction of important states to aid training [20]. We leverage SR as a behavior indicator, regarding states that may lead to a lower total reward as ill-famed, and using SR as a behavior indicator to avoid actions to those states. For example, in the VRP, we might not be willing to let our vehicle return too soon to the depot; the ill-famed state is the situation in which the agent is at the depot without having depleted its load.

Figure 1
The rollout processes. A vehicle starts from the depot and must deliver items to customers (represented as 1 … 4 ). For example, in the first step, nodes 1 and 3 are banned for some reason (the arrow from constraint to action is black). Thus, actions can only be chosen in 1 and 4 . Under the greedy strategy, the vehicle chooses the action with the maximum probability. This continues from step 1 to T, forming the rollout trajectory.

Methodology
Our methodology has three parts: first, we give the policy gradient for the DCMDP to formally represent a simple solver for VRP; then, a constraint-awareness module is represented to diminish actions against constraints; finally, an SR-based method is designed to address the sparse reward problem.

Policy Gradient in DCMDP
A policy gradient for the DCMDP must be built to train our policy function. The trajectory probability ( ) reveals the rollout process. Taking logarithms on both sides, we have To generalize, we take the expectation of trajectories, and with Jensen's inequality [24], the expectation can be formed as The right side of the equation is the lower bound of the form of the expectation of trajectories. With reinforcement learning [28], the function τ∼ (⋅) �log (τ) (τ)� can be written as However, this form of policy gradient may lead to high variance, which restricts the ability to generalize. To reduce the variance, we define the baseline function ( , ) : × → , which we show in Equation (5) is unbiased: is similar to the baseline methods in standard variance reduction RL. Differently, we extend the baseline function with constraints as input.
However, ( , ) is for the one-step baseline function. To extend it to the trajectory form, we can simply sum them up as � (κ) = ∑ ( , ) �1 , or build a neural network to represent it as � (κ; λ), where \lambda denotes the weights for the neural network and κ = [ 1 , 1 , 2 , 2 … , ] is the vector of whole states and constraints for a trajectory. Since ( �1 | , ) and ( | ) are irrelevant to θ, based on the analysis above, we have the policy gradient for DCMDP: (4) However, this form of policy gradient may lead to high variance, which restricts the ability to generalize. To reduce the variance, we define the baseline function ( , ) : × → which we show in Equation (5) is unbiased: optimal policy * that maximizes the total reward .

Successor Representation
Introduced to describe cognitive phenomena in the human brain, successor representation (SR) focuses on the extraction of important states to aid training [20]. We leverage SR as a behavior indicator, regarding states that may lead to a lower total reward as ill-famed, and using SR as a behavior indicator to avoid actions to those states. For example, in the VRP, we might not be willing to let our vehicle return too soon to the depot; the ill-famed state is the situation in which the agent is at the depot without having depleted its load.

Figure 1
The rollout processes. A vehicle starts from the depot and must deliver items to customers (represented as 1 … 4 ). For example, in the first step, nodes 1 and 3 are banned for some reason (the arrow from constraint to action is black). Thus, actions can only be chosen in 1 and 4 . Under the greedy strategy, the vehicle chooses the action with the maximum probability. This continues from step 1 to T, forming the rollout trajectory.

Methodology
Our methodology has three parts: first, we give the policy gradient for the DCMDP to formally represent a simple solver for VRP; then, a constraint-awareness module is represented to diminish actions against constraints; finally, an SR-based method is designed to address the sparse reward problem.

Policy Gradient in DCMDP
A policy gradient for the DCMDP must be built to train our policy function. The trajectory probability ( ) reveals the rollout process. Taking logarithms on both sides, we have To generalize, we take the expectation of trajectories, and with Jensen's inequality [24], the expectation can be formed as The right side of the equation is the lower bound of the reinforcement learning [28], the function τ∼ (⋅) �log (τ) (τ)� can be written as However, this form of policy gradient may lead to high variance, which restricts the ability to generalize. To reduce the variance, we define the baseline function ( , ) : × → , which we show in Equation (5) is unbiased: is similar to the baseline methods in standard variance reduction RL. Differently, we extend the baseline function with constraints as input.
However, ( , ) is for the one-step baseline function. To extend it to the trajectory form, we can simply sum them up as � (κ) = ∑ ( , ) �1 , or build a neural network to represent it as � (κ; λ), where \lambda denotes the weights for the neural network and κ = [ 1 , 1 , 2 , 2 … , ] is the vector of whole states and constraints for a trajectory. Since ( �1 | , ) and ( | ) are irrelevant to θ, based on the analysis above, we have the policy gradient for DCMDP: (5) ( , ) is similar to the baseline methods in standard variance reduction RL. Differently, we extend the baseline function with constraints as input.
However, ( , ) is for the one-step baseline function. To extend it to the trajectory form, we can simply sum them up as ^( κ) = ∑ t = 1 ( , ), or build a neural net-

Figure 1
The rollout processes. A vehicle starts from the depot and must deliver items to customers (represented as N 1 … N 4 ). For example, in the first step, nodes 1 and 3 are banned for some reason (the arrow from constraint to action is black). Thus, actions can only be chosen in a 1 and a 4 . Under the greedy strategy, the vehicle chooses the action with the maximum probability. This continues from step 1 to T, forming the rollout trajectory

Successor Representation
Introduced to describe cognitive phenomena in the human brain, successor representation (SR) focuses on the extraction of important states to aid training [20]. We leverage SR as a behavior indicator, regarding states that using SR as a behavior indicator to avoid actions to those states. For example, in the VRP, we might not be willing to let our vehicle return too soon to the depot; the ill-famed state is the situation in which the agent is at the depot without having depleted its load.

Figure 1
The rollout processes. A vehicle starts from the depot and must deliver items to customers (represented as 1 … 4 ). For example, in the first step, nodes 1 and 3 are banned for some reason (the arrow from constraint to action is black). Thus, actions can only be chosen in 1 and 4 . Under the greedy strategy, the vehicle chooses the action with the maximum probability. This continues from step 1 to T, forming the rollout trajectory.

Methodology
Our methodology has three parts: first, we give the policy gradient for the DCMDP to formally represent a simple solver for VRP; then, a constraint-awareness module is represented to diminish actions against constraints; finally, an SR-based method is designed to address the sparse reward problem.

Policy Gradient in DCMDP
A policy gradient for the DCMDP must be built to train our policy function. The trajectory probability ( ) reveals the rollout process. Taking logarithms on both sides, we have To generalize, we take the expectation of trajectories, and with Jensen's inequality [24], the expectation can be formed as The right side of the equation is the lower bound of the form of the expectation of trajectories. With reinforcement learning [28], the function τ∼ (⋅) �log (τ) (τ)� can be written as However, this form of policy gradient may lead to high variance, which restricts the ability to generalize. To reduce the variance, we define the baseline function ( , ) : × → , which we show in Equation (5) is unbiased: ( , ) is similar to the baseline methods in standard variance reduction RL. Differently, we extend the baseline function with constraints as input.
However, ( , ) is for the one-step baseline function. To extend it to the trajectory form, we can simply sum them up as � (κ) = ∑ ( , ) �1 , or build a neural network to represent it as � (κ; λ), where \lambda denotes the weights for the neural network and κ = [ 1 , 1 , 2 , 2 … , ] is the vector of whole states and constraints for a trajectory. Since ( �1 | , ) and ( | ) are irrelevant to θ, based on the analysis above, we have the policy gradient for DCMDP:

Constraint-awareness Policy Optimization
Equation (6) provides a way to update the policy function. However, to only use the constraints as input might still not be enough to find information in the constraints (e.g., the relationship between environment and constraints). To further capture this information, we design a method to learn the constraints.
First, we notice that the relationship between constraints, environment, and actions can be represented as π( | ) = ∑ π� � , � � � �, where is the index of the constraint at time . This indicates that if trained properly, the policy π can implicitly learn the constraints. We assume there exists a strategy π( | ) that can find the best action under state . Regarding the constraints as the hidden variables, we have: Equations (7) and (8) are the form of the evidence lower bound. From information theory [33], we have Thus, to maximize log ( | ) is equivalent to Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have Thus, the KL term can be simplified to We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite ( | ) in vector Constraint-awareness policy optimization (CPO) can be formulated as follow: where β is a positive parameter to balance the objective function and KL term.
Remark. Because of the deterministic constraints in VRPs, through learning the constraints, agents can implicitly learn the dynamic functions of the environment, and hence can know the results of choosing a certain action, especially worse actions. Thus, our method is also called implicit CPO (ICPO).

Behavior Indicator
There is still one issue. The rewards are always sparse for VRP [23], that is, agents can only access the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: = ∑ ( , ) �1 . Due to the nature of VRP, the reward is only available at the end of an episode: = ( , ). To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as � = ∑ −1� ∈ � � �1 + , where � is a set of ill-famed states and ∈ � means that the state into time , is the ill-famed state. As mentioned above, an ill-famed state is one that may lead to low total reward. We add a negative term because we want these states to appear as little as possible. Now the ill-famed state term 1� ∈ � � can be trained by SR as The right-hand side and it has the same recursive format as temporal difference [28].
With the help of SR, agents will take possible bad behaviors into account to make better decisions. ICPO with SR is expressed as follow:

Constraint-awareness Policy Optimization
Equation (6) provides a way to update the policy function. However, to only use the constraints as input might still not be enough to find information in the constraints (e.g., the relationship between environment and constraints). To further capture this information, we design a method to learn the constraints.
First, we notice that the relationship between constraints, environment, and actions can be represent- , where i is the index of the constraint at time t. This indicates that if trained properly, the policy π can implicitly learn the constraints. We assume there exists a strategy π( | ) that can find the best action under state . Regarding the constraints as the hidden variables, we have:

Constraint-awareness Policy Optimization
Equation (6) provides a way to update the policy function. However, to only use the constraints as input might still not be enough to find information in the constraints (e.g., the relationship between environment and constraints). To further capture this information, we design a method to learn the constraints.
First, we notice that the relationship between constraints, environment, and actions can be represented as π( | ) = ∑ π� � , � � � �, where is the index of the constraint at time . This indicates that if trained properly, the policy π can implicitly learn the constraints. We assume there exists a strategy π( | ) that can find the best action under state . Regarding the constraints as the hidden variables, we have: Equations (7) and (8) are the form of the evidence lower bound. From information theory [33], we have Thus, to maximize log ( | ) is equivalent to minimizing D KL � ( , | ) ∥ ( | )�.
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have Thus, the KL term can be simplified to We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and exploration.
Constraint-awareness policy optimization (CPO) can be formulated as follow: where β is a positive parameter to balance the objective function and KL term.
Remark. Because of the deterministic constraints in VRPs, through learning the constraints, agents can implicitly learn the dynamic functions of the environment, and hence can know the results of choosing a certain action, especially worse actions. Thus, our method is also called implicit CPO (ICPO).

Behavior Indicator
There is still one issue. The rewards are always sparse for VRP [23], that is, agents can only access the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: = ∑ ( , ) �1 . Due to the nature of VRP, the reward is only available at the end of an episode: = ( , ). To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as � = ∑ −1� ∈ � � �1 + , where � is a set of ill-famed states and ∈ � means that the state into time , is the ill-famed state. As mentioned above, an ill-famed state is one that may lead to low total reward. We add a negative term because we want these states to appear as little as possible. Now the ill-famed state term 1� ∈ � � can be trained by SR as The right-hand side and it has the same recursive format as temporal difference [28].
With the help of SR, agents will take possible bad (8) Equations (7) and (8) are the form of the evidence lower bound. From information theory [33], we have

Constraint-awareness Policy Optimization
Equation (6) provides a way to update the policy function. However, to only use the constraints as input might still not be enough to find information in the constraints (e.g., the relationship between environment and constraints). To further capture this information, we design a method to learn the constraints.
First, we notice that the relationship between constraints, environment, and actions can be represented as π( | ) = ∑ π� � , � � � �, where is the index of the constraint at time . This indicates that if trained properly, the policy π can implicitly learn the constraints. We assume there exists a strategy π( | ) that can find the best action under state . Regarding the constraints as the hidden variables, we have: Equations (7) and (8) are the form of the evidence lower bound. From information theory [33], we have Thus, to maximize log ( | ) is equivalent to minimizing D KL � ( , | ) ∥ ( | )�.
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have Thus, the KL term can be simplified to exploration.
Constraint-awareness policy optimization (CPO) can be formulated as follow: where β is a positive parameter to balance the objective function and KL term.
Remark. Because of the deterministic constraints in VRPs, through learning the constraints, agents can implicitly learn the dynamic functions of the environment, and hence can know the results of choosing a certain action, especially worse actions. Thus, our method is also called implicit CPO (ICPO).

Behavior Indicator
There is still one issue. The rewards are always sparse for VRP [23], that is, agents can only access the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: = ∑ ( , ) �1 . Due to the nature of VRP, the reward is only available at the end of an episode: = ( , ). To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as � = ∑ −1� ∈ � � �1 + , where � is a set of ill-famed states and ∈ � means that the state into time , is the ill-famed state. As mentioned above, an ill-famed state is one that may lead to low total reward. We add a negative term because we want these states to appear as little as possible. Now the ill-famed state term 1� ∈ � � can be trained by SR as The right-hand side and it has the same recursive format as temporal difference [28]. (9) Thus, to maximize log ( | ) is equivalent to minimizing D KL ( ( , | ) ∥ ( | )).
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint i at time t, t = 1 means that the constraint takes no effect. With those conditions, we have minimizing D KL � ( , | ) ∥ ( | )�.
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have ( , | ) = π( | , ) ( | ) s.t.
We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite indicates that when node constraint of node exists: = 0. In this condition, the probability to choose action to that node is zero. We also average all available actions and minimize the distance between � and through the KL term with a policy to encourage  (10) Thus, the KL term can be simplified to � ( , | ) ∥ ( | )�
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have ( , | ) = π( | , ) ( | ) s.t.
We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite indicates that when node constraint of node exists: = 0. In this condition, the probability to choose action to that node is zero. We also average all available actions and minimize the distance between � and through the KL term with a policy to encourage Recall the ge is only avai ( , ). To ill-famed sta total reward a set of ill-f state into t mentioned a lead to low because we possible. No can be traine The right-ha format as tem With the hel behaviors in ICPO with S where α is a influence of Remark. 1) in the oppo consideratio formulation (11) We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite ( | ) in vector form as Thus, to maximize log ( | ) is equivalent to minimizing D KL � ( , | ) ∥ ( | )�.
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have ( , | ) = π( | , ) ( | ) s.t.
We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite indicates that when node constraint of node exists: = 0. In this condition, the probability to choose action to that node is zero. We also average all available actions and minimize the distance between � and through the KL term with a policy to encourage Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have Thus, the KL term can be simplified to min � ( , | )| ( | )� = min (π( | } , )| ( | ) .
We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite ( | )  indicates that when node constraint of node exists: = 0. In this condition, the probability to choose action to that node is zero. We also average all available actions and minimize the distance between � and through the KL term with a policy to encourage Thus, to maximize log ( | ) is equivalent to minimizing D KL � ( , | ) ∥ ( | )�.
Notice also that for VRP, the relationship between constraints and the environment is deterministic, meaning that ( | ) = 1 , as mentioned above. Moreover, ( | ) has the same dimension with action, and for each constraint at time , = 1 means that the constraint takes no effect. With those conditions, we have ( , | ) = π( | , ) ( | ) s.t.
Thus, the KL term can be simplified to We leverage the max entropy strategy to induce exploration [21]. We combine constraint-awareness and the max entropy strategy and rewrite ( | ) in vector indicates that when node constraint of node exists: = 0. In this condition, the probability to choose action to that node is zero. We also average all available actions and minimize the distance between � and through the KL term with a policy to encourage indicates that when node constraint of node i exists: t = 0. In this condition, the probability to choose action to that node is zero. We also average all available actions and minimize the distance between p and p through the KL term with a policy to encourage exploration.
Constraint-awareness policy optimization (CPO) can be formulated as follow:

Constraint-awareness Policy Optimization
Equation (6) provides a way to update the policy function. However, to only use the constraints as input might still not be enough to find information in the constraints (e.g., the relationship between environment and constraints). To further capture this information, we design a method to learn the constraints.
First, we notice that the relationship between constraints, environment, and actions can be represented as π( | ) = ∑ π� � , � � � �, where is the index of the constraint at time . This indicates that if trained properly, the policy π can implicitly learn the constraints. We assume there exists a strategy π( | ) that can find the best action under state . Regarding the constraints as the hidden variables, we have:  Notice also that for VRP, the relationship between constraints and the environment is deterministic, exploration.

Constraint-awareness policy optimization (CPO)
can be formulated as follow: where β is a positive parameter to balance the objective function and KL term.
Remark. Because of the deterministic constraints in VRPs, through learning the constraints, agents can implicitly learn the dynamic functions of the environment, and hence can know the results of choosing a certain action, especially worse actions. Thus, our method is also called implicit CPO (ICPO).

Behavior Indicator
There is still one issue. The rewards are always sparse for VRP [23], that is, agents can only access the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: = ∑ ( , ) �1 . Due to the nature of VRP, the reward is only available at the end of an episode: = ( , ). To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as � = ∑ −1� ∈ � � �1 + , where � is a set of ill-famed states and ∈ � means that the (12) where β is a positive parameter to balance the objective function and KL term.

Remark.
Because of the deterministic constraints in VRPs, through learning the constraints, agents can implicitly learn the dynamic functions of the environment, and hence can know the results of choosing a certain action, especially worse actions. Thus, our method is also called implicit CPO (ICPO).

Behavior Indicator
There is still one issue. The rewards are always sparse for VRP [23], that is, agents can only access the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: Information Technology and Control 2022/1/51 132 R = ∑ t =1 ( , ). Due to the nature of VRP, the reward is only available at the end of an episode: R = ( , ) To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as R = ∑ t=1 ( ∈ ^ ) + , where ^ is a set of ill-famed states and ∈ ^ means that the state into time t, is the ill-famed state. As mentioned above, an ill-famed state is one that may lead to low total reward. We add a negative term because we want these states to appear as little as possible. Now the ill-famed state term 1( ∈ ^ ) can be trained by SR as the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: = ∑ ( , ) �1 . Due to the nature of VRP, the reward is only available at the end of an episode: = ( , ). To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as � = ∑ −1� ∈ � � �1 + , where � is a set of ill-famed states and ∈ � means that the state into time , is the ill-famed state. As mentioned above, an ill-famed state is one that may lead to low total reward. We add a negative term because we want these states to appear as little as possible. Now the ill-famed state term 1� ∈ � � can be trained by SR as The right-hand side and it has the same recursive format as temporal difference [28].
With the help of SR, agents will take possible bad behaviors into account to make better decisions. ICPO with SR is expressed as follow: where α is a positive hyperparameter to control the influence of successors.
Remark. 1) We emphasize that SR can be extended in the opposite way. That is, we can maximize consideration of actions that are encouraged. The formulation is the same, but with a positive signal. (13) The right-hand side and it has the same recursive format as temporal difference [28].
With the help of SR, agents will take possible bad behaviors into account to make better decisions. ICPO with SR is expressed as follow:  [23], that is, agents can only access the final reward at the last step, resulting in a bad credit assignment problem [11]. Hence, to make accurate credit assignments is crucial. A property of VRP is that there exist such actions that the more you choose the lower the expected rewards is. To leverage this property, we choose SR as the tool to indicate these bad behaviors.
Recall the general form of the reward function: = ∑ ( , ) �1 . Due to the nature of VRP, the reward is only available at the end of an episode: = ( , ). To mitigate the sparse reward, we take the ill-famed states into consideration and rewrite the total reward as � = ∑ −1� ∈ � � �1 + , where � is a set of ill-famed states and ∈ � means that the state into time , is the ill-famed state. As mentioned above, an ill-famed state is one that may lead to low total reward. We add a negative term because we want these states to appear as little as possible. Now the ill-famed state term 1� ∈ � � can be trained by SR as The right-hand side and it has the same recursive format as temporal difference [28].
With the help of SR, agents will take possible bad behaviors into account to make better decisions. ICPO with SR is expressed as follow: where α is a positive hyperparameter to control the influence of successors.
Remark. 1) We emphasize that SR can be extended in the opposite way. That is, we can maximize consideration of actions that are encouraged. The formulation is the same, but with a positive signal. (14) where α is a positive hyperparameter to control the influence of successors.
Remark. 1) We emphasize that SR can be extended in the opposite way. That is, we can maximize consideration of actions that are encouraged. The formulation is the same, but with a positive signal. 2) SR also acts as a regulator to balance human intuition and learning results.

Training Method
Loss Functions. We design a policy, baseline function, and successor value as neural networks with parameters θ, λ, and η, respectively. The losses of successor value and baseline function are 2) SR also acts as a regulator to balance human intuition and learning results.

Training Method
Loss Functions. We design a policy, baseline function, and successor value as neural networks with parameters θ, λ, and η, respectively. The losses of successor value and baseline function are where is the episode buffer. All the loss functions can be estimated through Monte Carlo sampling. The policy gradient with constraint-aware module and successor representation can be formed as Equation (17).
The pseudocode can be found in Algorithm 1.

Algorithm 1
Implicit Constraint-awareness policy optimization function. For the baseline function, since the state and constraints are taken as input, we build the model without sharing variables with the policy networks.
For full understanding of our network, we explain each structure. Our network structure, shown in Figure 2, is mainly built from Nazari et al. and Vinyals et al. [23,31], and we make some improvements. Unless otherwise mentioned, the activation is ReLU [22].

Figure 2
Structure of policy, baseline, and successor networks. The structure has three parts: (a) the encoder network aims to convert states to embeddings; , , and are encoders for action states and constraints, respectively; (b) the policy and successor; ℎ is the longterm hidden state for RNN, and context is useful information extracted from the attention model; and (c) baseline function. (15) = �1 , ∼ ��1� ∈ � �+ γ ( �1 ; η)− ( ; η)� 2 �, (16) where z is the episode buffer. All the loss functions can be estimated through Monte Carlo sampling. The policy gradient with constraint-aware module and successor representation can be formed as Equation (17).
where is the episode buffer. All the loss functions can be estimated through Monte Carlo sampling. The policy gradient with constraint-aware module and successor representation can be formed as Equation (17).
The pseudocode can be found in Algorithm 1.

Algorithm 1
Implicit Constraint-awareness policy optimization

12:
Update θ, λ, and η through Equations (17), (15) and (16) respectively 13: end while Network Structure. The policy network structure has three parts: decision making, encoder, and attention mechanism [3,23]. The first is used to choose the policy, and the second and third to encode the graph and capture valuable information. A gated recurrent unit is introduced to seize the long-term effects [5]. We leverage the attention mechanism output as the input of the successor  (17) The pseudocode can be found in Algorithm 1.

Algorithm 1
Implicit Constraint-awareness policy optimization Algorithm 1: Initializing parameters for actor, successor, and baseline function θ, λ, and η 2: Generating training dataset 3: while not convergence do 4: Sample 0 from the training dataset 5: Initializing the history information 0 6: Initializing the episode buffer 0 Rollout Stage 7: for 0 to the maximum of nodes do 8: Select action ∼ π( ) with Boltzmann exploration strategy 9: Update the environment with dynamic function 10: +1 = ∪ ( , , ) 11: end for Train Stage 12: Update θ, λ, and η through Equations (17), (15) and (16) respectively 13: end while Network Structure. The policy network structure has three parts: decision making, encoder, and attention mechanism [3,23]. The first is used to choose the policy, and the second and third to encode the graph and capture valuable information. A gated recurrent unit is introduced to seize the long-term effects [5]. We leverage the attention mechanism output as the input of the successor function. For the baseline function, since the state and constraints are taken as input, we build the model without sharing variables with the policy networks.
For full understanding of our network, we explain each structure. Our network structure, shown in Figure 2, is mainly built from Nazari et al. and Vinyals et al. [23,31], and we make some improvements. Unless otherwise mentioned, the activation is ReLU [22].
Objective function. The objective function for VRP is to minimize the tour length, but the objective function in RL is always represented as maximizing the total reward. Therefore, we set the total reward in VRP as the negative tour length Input. In the VRPTW, we set the current state and constraints as input. The state for the VRP includes the location of the nodes, remaining load, and demand of customers. For VRPTW, since each customer has its own time window, we add two features to provide extra information: 1) the time window of each customer; and 2) the current time. Moreover, the previous action is also added as input to reveal the current location of the vehicle. The input size is ℎ × 2 × | | for the location of nodes and the time window. The size for other features is ℎ × 1 × | |.

Attention
Mechanism. An attention mechanism (AM) captures the internal relationships within a graph, where the embedding taken from GRU is used as the input embedding. The output of AM is a context, which is a 32-dimension vector combining short-term and long-term information. The setting of AM is similar to luong et al. [19].
Decision Making Network. The output of the decision-making network is | |, and the input is the context from the AM. Softmax is used to generate the probability of each action (the activation function in the final layer is softmax).

Successor.
A successor is an indicator of possible bad behaviors. We take the context as input, and the output dimension is 1 (the successor value is a scalar).
Baseline Function. The baseline function consists of three 1 × 1 convolutional neural networks to extract state and constraint information. The output dimension is 1, a scalar.
Action encoder. The action encoder converts the last step action to a vector. The output is of size 16.

Experiments
To verify the effectiveness of our approach, we conducted extensive experiments on the VRPTW on two different datasets: a generated dataset and the Solomon benchmark [27]. A well-known dataset for VRPTW studies, the Solomon benchmark, contains multiple instances in three scales: 25, 50, and 100. Like most learning-based methods, our approach requires substantial training data, and its precision advantage is more reflected in statistics, it is necessary to build a generated dataset as the supplement of Solomon benchmark which just has decades instances. Based on the following rules, we randomly generated 100,000 training samples and 1,000 testing samples for each scale of the VRPTW 1 and compared our approach to other baselines on two datasets.

Generated Dataset
The VRPTW is similar to the VRP, but each customer has its own time window. We generated 10, 20, 50, and 100 nodes with random locations and demands [23]. Each node was randomly located in a two-dimensional discrete coordinate system with range [0,100], and its demand had a uniform distribution ∼ (1,10) . The capabilities of vehicles were 20, 30, 40, and 50 for size 10, 20, 50, and 100, respectively.
Assume that at time step , a vehicle with current load is preparing to send items to customer , who requires ϵ items. When ϵ ≤ , the vehicle can send the item, and the remaining load becomes +1 = ϵ − . Otherwise, the trade can not be established. Moreover, when no customer can be satisfied, the vehicle is forced to return to the depot. 1 The source code can be visited in https://gitee.com/MARL_ Researcher/vrptw-generator.git

Figure 2
Structure of policy, baseline, and successor networks. The structure has three parts: (a) the encoder network aims to convert states to embeddings; E a , E s L , and E c L are encoders for action states and constraints, respectively; (b) the policy and successor; h t is the long-term hidden state for RNN, and context is useful information extracted from the attention model; and (c) baseline function nction, meters alue

5)
, (16) ns can policy cessor (17) re has tention policy, apture duced ge the cessor For full understanding of our network, we explain each structure. Our network structure, shown in Figure 2, is mainly built from Nazari et al. and Vinyals et al. [23,31], and we make some improvements. Unless otherwise mentioned, the activation is ReLU [22].

Figure 2
Structure of policy, baseline, and successor networks. The structure has three parts: (a) the encoder network aims to convert states to embeddings; , , and are encoders for action states and constraints, respectively; (b) the policy and successor; ℎ is the longterm hidden state for RNN, and context is useful information extracted from the attention model; and (c) baseline function.
Objective function. The objective function for VRP is to minimize the tour length, but the objective function in RL is always represented as maximizing the total reward. Therefore, we set the total reward in VRP as the negative tour length Input. In the VRPTW, we set the current state and constraints as input. The state for the VRP includes the location of the nodes, remaining load, and demand of customers. For VRPTW, since each customer has its own time window, we add two features to provide extra information: 1) the time window of each customer; and 2) the current time. Moreover, the previous action is also added as input to reveal the current location of the vehicle. The input size is ℎ × 2 × | | for the location of nodes and the time window. The size for other features is Attention Mechanism. An attention mechanism (AM) captures the internal relationships within a graph, where the embedding taken from GRU is used as the input embedding. The output of AM is a context, which is a 32-dimension vector combining short-term and long-term information. The setting of AM is similar to luong et al. [19].
Decision Making Network. The output of the decision-making network is | |, and the input is the context from the AM. Softmax is used to generate the probability of each action (the activation function in the final layer is softmax).
The time window is the key difference between the VRP and VRPTW. Since the VRPTW must consider the time, the time step t no longer reveals the time interval. Here, we assume that each step indicates a transition to a new step, and we denote the time interval as t. We formally illustrate the update of t and t below. We assume the velocity of a vehicle is constant; hence, time is proportional to distance. To avoid the case when a vehicle will never reach some customers, the farthest distance from the depot capabilities of vehicles were 20, 30, 40, and 50 for size 10, 20, 50, and 100, respectively. Assume that at time step , a vehicle with current load is preparing to send items to customer , who requires ϵ items. When ϵ ≤ , the vehicle can send the item, and the remaining load becomes �1 = ϵ − . Otherwise, the trade can not be established. Moreover, when no customer can be satisfied, the vehicle is forced to return to the depot.
The time window is the key difference between the VRP and VRPTW. Since the VRPTW must consider the time, the time step no longer reveals the time interval. Here, we assume that each step indicates a transition to a new step, and we denote the time interval as . We formally illustrate the update of and ̂ below. We assume the velocity of a vehicle is constant; hence, time is proportional to distance. To avoid the case when a vehicle will never reach some customers, the farthest distance from the depot max 0, is smaller than its time 1 The source code can be visited in https://gitee.com/MARL_Researcher/vrptw-generator.git Baseline. To reveal the effectiveness of our method, we considered the following baselines: 1. The genetic algorithm (GA) was used as the heuristic algorithm baseline, which performs well in the VRPTW [18]. 2. For the search algorithm, we chose the nearest neighbor (NN) [15]. 3. For the RL baseline method, we took a state-of-the-art RL method as the baseline [23]. 4. We also compared Google OR tools [9], a fast and portable software suite to solve combinatorial optimization problems, including VRP. Reward Setting. Common objective functions of the VRPTW include the delivery percentage, total tour length, or both [16]. In our experiments, we chose the total tour length as our objective function, and the total reward was the negative tour length.

Experiment Setup
We trained our model on a single GeForce RTX 2080, using Adam as the optimizer [14]. For each scale of the model, we performed 10 cycles of training on a training set containing 100,000 instances. The model is considered to have converged at the end of the training. The training time for the VRPTW is 4h, 7h, 11h, and 21h respectively, for 10, 20, 50, and 100 nodes, with a batch size of 256. Boltzmann exploration was used to improve the quality of our method. We used beam search (BS), a widely used optimization method in natural language processing, as a search strategy [13,15], and γ was 0.95.
Baseline. To reveal the effectiveness of our method, we considered the following baselines: 1. The genetic algorithm (GA) was used as the heuristic algorithm baseline, which performs well in the VRPTW [18]. 2. For the search algorithm, we chose the nearest neighbor (NN) [15]. 3. For the RL baseline method, we took a state-of-the-art RL method as the baseline [23]. 4. We also compared Google OR tools [9], a fast and portable software suite to solve combinatorial optimization problems, including VRP.
Ablations. We conducted substantial ablation studies: 1) ICPO: our complete method. 2) ICPO . When the current time t is not in the range of customer i's time window, t ∉ , the trade cannot be established. Similar to VRP, when no customer can be satisfied, the vehicle is required to return to the depot.
Reward Setting. Common objective functions of the VRPTW include the delivery percentage, total tour length, or both [16]. In our experiments, we chose the total tour length as our objective function, and the total reward was the negative tour length.

Experiment Setup
We trained our model on a single GeForce RTX 2080, using Adam as the optimizer [14]. For each scale of the model, we performed 10 cycles of training on a training set containing 100,000 instances. The model is considered to have converged at the end of the training. The training time for the VRPTW is 4h, 7h, 11h, and 21h respectively, for 10, 20, 50, and 100 nodes, with a batch size of 256. Boltzmann exploration was used to improve the quality of our method. We used beam search (BS), a widely used optimization method in natural language processing, as a search strategy [13,15], and γ was 0.95.

Baseline.
To reveal the effectiveness of our method, we considered the following baselines: 1. The genetic algorithm (GA) was used as the heuristic algorithm baseline, which performs well in the VRPTW [18]. 2. For the search algorithm, we chose the nearest neigh-bor (NN) [15]. 3. For the RL baseline method, we took a state-of-the-art RL method as the baseline [23]. 4. We also compared Google OR tools [9], a fast and portable software suite to solve combinatorial optimization problems, including VRP.
Ablations. We conducted substantial ablation studies: 1) ICPO: our complete method. 2) ICPO w/o SR: remove the SR from our framework. 3) ICPO w/o KL: remove constraint-awareness from our framework. 4) Original policy gradient (PG): remove KL, SR, and constraints as input (the original PG method).

Results
Performance. Table 1 presents the results of several algorithms run on the generated dataset where the capabilities of vehicles were 20, 30, 40, 50 for size 10, 20, 50, 100, respectively and each size contains 1000 instances. Every two columns present the mean of total distance and total CPU time in a kind of environment with different number of nodes. As shown in Table  1, our approaches dramatically outperformed other baselines in most VRPTW environments, and they had the fewest outliers. In particular, the genetic algorithm and OR-tools solved the problem of fast deterioration under the large-scale problem. We think this is because those methods lack sufficient numbers of iterations in a reasonable time.
Moreover, we find that as the size of the VRPTW increases, our method becomes better than RL method baselines. This is because when the size of the VRPTW increases, the constraints more seriously disturb the performance of the solver, and without considering the constraints, the result will be much worse. Table 2 presents the results for the Solomon benchmark. The best known solution is reported by Solomon Dataset [27]. Since this benchmark had insufficient data to train the neural network models, we pre-trained the model on the generated dataset. In addition, we employed beam search to improve our solutions at rollout. From the results, our approach outperformed the RL method.

Runtime.
We compared the runtimes of our method to baselines, as shown in Figure 3. Due to the great disparity of methods (ours only used 14 seconds, while OR-tools took about 2 minutes in the VRPTW 50 to calculate 1000 instances), we took the log 2 of  Table 1 Mean distance and CPU times of compared methods. s means seconds GA and OR-Tools to make this figure legible. However, the runtimes of these two methods were so long that even in log form, the gap was still obvious. Thus,

Figure 3
Runtime of five methods right is for GA and OR-Tools, and that on the left is for the others. We can find that our method is in the middle among all the methods as regards speed (an acceptable running time). Combining Figure 3 and Table 2, we can see that although NN is the fastest, its performance is worst, hence it is hard to use in practice, while ours can maintain a good balance between runtime and solution quality.

Figure 3
Runtime of five methods Ablations. As shown in Table 1, the complete version of ICPO achieved the highest scores of the three methods. From the ablation presented in Figure 4, we find that ICPO w/o KL is worse than ICPO w/o SR, revealing that constraint awareness plays an important role in getting a good solution, which agrees with our theory. Comparing with original PG, the performance of our method is dramatically better than that of original PG, revealing that ICPO has an advantage in VRPTW.
to capture the information of constraints to improve performance. Specifically, we changed the VRPTW and the PG method to the DCMDP. To capture the constraints, we designed a constraintawareness module to reduce the probability of actions against the constraints and enhance performance. For bad behaviors that could decrease the total reward, we leveraged SR as the indicator to diminish the occurrence of those actions. We designed a VRPTW training scheme, and the experiments on the generated datasets and Solomon benchmark revealed that our methods outperform other competition methods.
In the future, we will focus on how to implement the method in practice and consider situations in which agents are competitive.

Figure 4
Ablation of ICPO  Ablation of ICPO

Conclusion
We developed a constraint-awareness RL method to capture the information of constraints to improve performance. Specifically, we changed the VRPTW and the PG method to the DCMDP. To capture the constraints, we designed a constraintawareness module to reduce the probability of actions against the constraints and enhance performance. For bad behaviors that could decrease the total reward, we leveraged SR as the indicator to diminish the occurrence of those actions. We designed a VRPTW training scheme, and the experiments on the generated datasets and Solomon benchmark revealed that our methods outperform other competition methods.
In the future, we will focus on how to implement the method in practice and consider situations in which agents are competitive.
we used two y-axes: the y-axis on the right is for GA and OR-Tools, and that on the left is for the others. We can find that our method is in the middle among all the methods as regards speed (an acceptable running time). Combining Figure 3 and Table 2, we can see that although NN is the fastest, its performance is worst, hence it is hard to use in practice, while ours can maintain a good balance between runtime and solution quality.
Ablations. As shown in Table 1, the complete version of ICPO achieved the highest scores of the three methods. From the ablation presented in Figure 4, we find that ICPO w/o KL is worse than ICPO w/o SR, revealing that constraint awareness plays an important role in getting a good solution, which agrees with our theory. Comparing with original PG, the performance of our method is dramatically better than that of original PG, revealing that ICPO has an advantage in VRPTW.

Conclusion
We developed a constraint-awareness RL method to capture the information of constraints to improve performance. Specifically, we changed the VRPTW and the PG method to the DCMDP. To capture the constraints, we designed a constraint-awareness module to reduce the probability of actions against the constraints and enhance performance. For bad behaviors that could decrease the total reward, we leveraged SR as the indicator to diminish the occurrence of those actions. We designed a VRPTW training scheme, and the experiments on the generated datasets and Solomon benchmark revealed that our methods outperform other competition methods.
In the future, we will focus on how to implement the method in practice and consider situations in which agents are competitive.