Automatic Text Summarization Using Deep Reinforcement Learning and Beyond

In the era of big data


Introduction
Artificial intelligence technology is in a period of rapid development, and its application in various industries is becoming increasingly common. From medical diagnosis to social networks, and from intelligent education to news media, specific application cases of artificial intelligence can be seen everywhere. In people's daily lives, we are facing information overload, and how to deal with massive data information in limited time has become a problem. Using computers to understand natural language can filter useless and redundant information and only retain key information feedback to users. This specific application is called automatic text summarization. It uses a computer to summarize the whole text, helping users understand the core semantics of the original text directly by reading the abstract. Therefore, the machine learning model that automatically extracts summaries can quickly extract key information from massive texts, saving users valuable time. The emergence of automatic text summarization not only reduces information overload but also saves the high cost of manual text summarization.
Automatic text summarization is mainly divided into two types of methods. The extractive summarization technique extracts several key sentences from the original text and then forms a single abstract. The abstractive summarization technique understands the semantics of the original text and summarizes and induces the subject matter using a human-readable language. At present, extractive summarization is relatively easy to implement, so it is more widely used. The abstractive summarization requires the ability of the computer to understand the original text, so it has higher technical requirements.
Traditional extractive summarization techniques have two types: graph-based sorting methods and artificial feature-based methods. Based on the graph sorting method, each sentence in the document is used as a node of the graph, and calculate the similarity between sentences, and the value of the similarity is used as the weight of the edge to construct the graph model. Then, the PageRank algorithm [36] is used to solve each sentence score; finally, the highest scores are output as a summarization. The representative algorithms are TextRank [30] and LexRank [11]. The artificial feature-based method usually builds a model based on the length of each sentence, whether the sentence contains the title words. The representative algorithm is TextTeaser. With the rise of deep learning and its powerful feature representation ability, an increasing number of extractive summarization techniques are proposed, and excellent results have been obtained. Extractive summarization techniques do not need to consider the extracted summarization syntax and semantic problems. However, since the extracted summarization is only the combination of the original sentences, there are often problems such as inconsistency and information redundancy. Comparatively speaking, abstractive summarization is more convenient for humans to understand. The emergence of the deep learning-based sequence generation model [2] makes abstractive summarization a research hotspot. The se-q2seq model [5] is applied as a benchmark model in the abstractive summarization task, and many deep learning models are emerging. The best results have been achieved on the corresponding datasets.
Aiming at the insufficient utilization of contextual semantic information, the insufficient semantic understanding of the attention mechanism, and the low accuracy of text summarization in the previous text summarization algorithms, this paper proposes an automatic summarization optimization algorithm, which first uses the seq2seq model with an attention mechanism [45] as the basic model for initial summarization generation and then uses depth enhancement learning to optimize the initial summarization directly through the ROUGE evaluation criteria [23]. The abstract generated by the base model is an abstractive summarization, and the optimization algorithm proposed selects the word output distribution of the base model in the decoding stage and the words of the top-k highest probability distributions that constitute the attention distribution of the original word as action spaces for enhanced learning. The initial summarization is optimized by the enhanced learning technique.
The main contributions of this paper are as follows: 1 This paper presents a two-stage optimization method for automatic text summarization that combines abstractive summarization and extractive summarization for the first time.

Seq2seq Model
The seq2seq model is widely used in natural language processing [6,34,17]. The seq2seq model consists mainly of an encoder and a decoder. The encoder uses a cyclic neural network (RNN) such as the long shortterm memory network [18] (LSTM) to encode the input sequence into a vector of fixed dimensions, and the decoder then uses RNN to decode the vector to produce an output sequence. Applying the attention mechanism [2] to the seq2seq model, it is possible to assign different weights to different parts of the input sequence during sequence generation. In natural language tasks, the seq2seq model typically uses a fixed vocabulary of input and output, which results in a poor representation of words that appear outside the vocabulary. The method of pointing to some unusual words or subsequences in an input sequence through a decoder network and then copying them directly into the output sequence [47,24] can largely solve this problem. Gulcehre [13] and Merity [29] applied this pointing mechanism to the decoding process; then, the model can not only generate vocabularies in the vocabulary but also output uncommon words.

Reinforcement Learning and Sequence Generation
The emergence of Alpha Go has created considerable interest in artificial intelligence. Reinforcement learning is the most important technology in Alpha Go, which is a learning control strategy framework through computer algorithms. Given the agent and interaction environment [42], the agent can be trained to learn a strategy through reinforcement learning so that it can obtain the maximum reward. Compared with the traditional supervised learning method, reinforcement learning can be used to solve the problem when the agent has to perform discrete actions or when the optimization process is not defined. The process for optimizing sequence generation problems directly through metrics such as BELU, ROUGE, and METEOR is not divisible, so reinforcement learning can be applied to sequence generation tasks.
To achieve direct optimization of task evaluation criteria, Ranzato [38] used the REINFORCE algorithm [48] to train a cyclic neural network-based model for sequence generation tasks compared to traditional supervised learning. The results of the enhanced learning training method have been significantly improved. Bahdanau [1] proposed an evaluation-decision method to train neural network generation sequences, use a decision network to predict output actions, and then use an evaluation network to evaluate the value of the decision-making network to generate actions while also making the training process more stable. Rennie [39] proposed a self-assessment sequence generation training method without an additional decision network, and the evaluation results in the picture title generation task were significantly improved. Guo [15] proposed an iterative decoding output sequence based on the depth Q network [32] (DQN) training the sequence-to-sequence learning task. Guo's method [15] stimulates the automatic text summarization optimization method, generates an initial summarization and candidate action space through a seq2seq model with an attention mechanism, and directly optimizes the initial summarization evaluation standard (ROUGE) using DQN.

Automatic Text Summarization
Automated abstract research focuses on two areas: text [9,35,27] and speech [50,51]. Although the abstractive summarization study has made some progress, most of the outstanding performance summarization models are still based on extractive methods. Traditional extractive summarization methods are mostly based on a greedy search [3] and graph model methods [11]. Kageback [21] implemented document summarization generation by deploying a recursive autoencoder [43]. Yin [49] used a convolutional neural network to minimize the objective function based on diversity and importance, and select sentences to generate summaries. Nallapati [34] adjusted a ques- tion and answer dataset of DeepMind [17] to a summarization dataset, the CNN/DailyMail dataset, and proposed the first benchmark model of the abstractive summarization on this dataset. On this dataset, Cheng and Lapata [5] proposed an attention-oriented encoder-decoder framework for abstract extraction. Nallapati [33] also proposed an model, which constructs a hierarchical cyclic neural network to select and extract original sentences.
Although the extractive summarization method is simpler and has some errors, there are also problems such as inconsistent semantics of the abstractive summarization context and unclear references. The generalized approach is freer and more in line with human writing and thinking patterns and can generate new and diverse sentences. With the advent of neural network-based text generation models [36], abstractive summarization technology is becoming a research hotspot. Rush [40] proposed an attention model with a convolutional encoder. On the CNN/ DailyMail dataset, Chen [4] proposed a novel attention mechanism and applied it to the summarization generation model. On the CNN/DailyMail dataset, Nallapati [34] constructed a hierarchical network structure model using a hierarchical attention mechanism and pointer functions. On the same dataset, See [41] proposed a pointer network and additionally used a loss term for the attention-coverage mechanism in the loss function of its model. Patel [37] studied on abstractive and extractive content rundown strategies. Kejun [22] proposed an improved word vector generation technique and an abstractive automatic summarization model. Minaee [31] discussed more than 150 deep learning based models for text classification.
This paper proposed the automatic text summarization optimization method. First, the initial summarization is generated and the candidate action space required for reinforcement learning by deploying a seq2seq model. The action candidate space is divided into two parts. One part is generated by the decoder of the seq2seq model. This process can be regarded as an abstractive summarization method. The other part is generated by the attention mechanism of the seq2seq model, which can be seen as an extractive summarization method. Second, DQN [32] is used to learn a strategy to directly optimize the initial summarization to obtain the maximum reward (ROUGE score).

Modeling
This section mainly introduces the automatic text summarization optimization model, which is based on DQN. DQN is deployed through a cyclic neural network (GRU-RNN) with a gated recurrent unit [7] (GRU) in an encoder-decoder architecture. First, the pretraining phase is actually a parameter training of the seq2seq model (the lower half of Figure 1) with the attention mechanism through maximum likelihood estimation (MLE) and generates the initial summarization after reaching the convergence state (i.e., abstractive summarization method. The other part is generated by the attention mechanism of the seq2seq model, which can be seen as an extractive summarization method. Second, DQN [32] is used to learn a strategy to directly optimize the initial summarization to obtain the maximum reward (ROUGE score).

Modeling
This section mainly introduces the automatic text summarization optimization model, which is based on DQN. DQN is deployed through a cyclic neural network (GRU-RNN) with a gated recurrent unit [7] (GRU) in an encoder-decoder architecture. First, the pretraining phase is actually a parameter training of the seq2seq model (the lower half of Figure 1) with the attention mechanism through maximum likelihood estimation (MLE) and generates the initial summarization after reaching the convergence state (i.e., figure). In addition, the candidate action space of the DQN (i.e., DQN is the action apace in Figure 1, generated by the attention mechanism and decoder). Second, in the DQN iterative decoding stage, the parameters of the DQN are initialized using pretrained parameter values. This phase is an iterative decoding process. DQN learns a certain strategy, selects actions from the action space, and implements iterative optimization of the initial summarization to obtain the maximum reward. The reward is the ROUGE score obtained by the similarity between the summarization generated by DQN and the real summarization in the iterative process.

Automatics Summarization Modeling
Let the vocabulary be Ω , and its vocabulary size be | | V Ω = ． x is set to represent an input sentence containing a sequence of m words, that is, in or m Th  in  hi  Th  ho  fix  an  Th  ne  hi  us  ob  Y. fr ta au ba in D ex th re co an in the figure). In addition, the candidate action space of the DQN (i.e., DQN is the action apace in Figure 1, generated by the attention mechanism and decoder). Second, in the DQN iterative decoding stage, the parameters of the DQN are initialized using pretrained parameter values. This phase is an iterative decoding process. DQN learns a certain strategy, selects actions from the action space, and implements iterative optimization of the initial summarization to obtain the maximum reward. The reward is the ROUGE score obtained by the similarity between the summarization generated by DQN and the real summarization in the iterative process.

Automatics Summarization Modeling
Let the vocabulary be Ω, and its vocabulary size be |Ω| = V. x is set to represent an input sentence containing a sequence of m words, that is, x = [x 1 ,..., x m ] . Among them, each word x i ÎΩ. The task of the automatic summarization is to generate a target sequence of length n words y = [y 1 ,..., y N ], such that y = arg max y P(y|x) where N < m . The automatic summarization model can be represented as a function P(y|x) = P(y|x; θ) with parameter θ, where θ is trained by maximizing the conditional probability of the {sentence-abstract} pairs in the training set. More specifically, given the last generated word, the training model generates the next word of the abstract.
unctions. On the same dataset, See [41] proposed a ointer network and additionally used a loss term for he attention-coverage mechanism in the loss function f its model. Patel [37] studied on abstractive and xtractive content rundown strategies. Kejun [22] roposed an improved word vector generation echnique and an abstractive automatic summarization odel. Minaee [31] discussed more than 150 deep earning based models for text classification.
This paper proposed the automatic text ummarization optimization method. First, the initial ummarization is generated and the candidate action pace required for reinforcement learning by deploying seq2seq model. The action candidate space is divided nto two parts. One part is generated by the decoder of he seq2seq model. This process can be regarded as an bstractive summarization method. The other part is enerated by the attention mechanism of the seq2seq odel, which can be seen as an extractive ummarization method. Second, DQN [32] is used to earn a strategy to directly optimize the initial ummarization to obtain the maximum reward ROUGE score).
This paper uses an encoder-decoder framework to model the conditional probability of (1). (1) Since (1) is difficult to solve, in practice, we instantiate this target in a serialized way, turning the original objective function into  (2) This paper uses an encoder-decoder framework to model the conditional probability of (1).

Seq2seq Model
The encoder encodes the observed sample sequence X into the variable Z, and the decoder decodes the hidden variable Z into an output target sequence Y. The traditional seq2seq model encodes regardless of how long the sequence of observation samples is into a fixed-dimensional hidden variable Z. Obviously, such an operation limits the ability of the seq2seq model. Therefore, Bahdanau [2] proposed a recurrent neural network search model. On the decoder side, using a hidden layer forward network, an adaptive method is used to calculate the weight of each word in the observation sequence X and the output target sequence Y.
This paper introduces the encoder-decoder framework to the automatic summarization problem takes the pure data-driven approach and trains the automatic summarization end-to-end model. As the basic model of this paper, this model generates the initial summarization and candidate action space of the DQN model.
Encoder: The network gated unit can better express the long-term and short-term dependence. In this paper, a cyclic neural network with a gated recurrent unit (GRU [8]) is used as the basic module for constructing a document encoder. A GRU consists of u and r: (2) This paper uses an encoder-decoder framework to model the conditional probability of (1).

Seq2seq Model
The encoder encodes the observed sample sequence X into the variable Z, and the decoder decodes the hidden variable Z into an output target sequence Y. The traditional seq2seq model encodes regardless of how long the sequence of observation samples is into a fixed-dimensional hidden variable Z. Obviously, such an operation limits the ability of the seq2seq model. Therefore, Bahdanau [2] proposed a recurrent neural network search model. On the decoder side, using a hidden layer forward network, an adaptive method is used to calculate the weight of each word in the observation sequence X and the output target sequence Y.
This paper introduces the encoder-decoder framework to the automatic summarization problem takes the pure data-driven approach and trains the automat- where W and b represent hyperparameters of the GRU; t x represents the input word vector at time t; t h denotes the hidden layer vector ( ) σ ⋅ ; tanh( ) ⋅ is the activation function at the corresponding time; and  is the bitwise multiplication operation. The forward GRU generates a hidden layer vector representation of each corresponding word vector of the forward sequence, while the backward GRU generates the hidden layer vector of each corresponding word vector of the reverse sequence.
Forward GRU generates forward sequences, and the hidden layer vector of each corresponding word vector represents f t h , while backward GRU generates the hidden layer vector b t h of each corresponding word vector of the reverse sequence.
The input sequence of the original text is used as the input of the encoder, and the hidden layer representation of each position is generated by the encoder. The semantic representation c of the input sequence can be directly assigned by the last hidden layer of the encoder, i.e., = m c h , or it can be the linear representation of the last hidden layer. Decoder and Attention mechanism: The semantic feature representation of the input sequence generated by the encoder is decoded as the initial input state of the decoder out of the target text sequence, i.e., the summarization. The decoder is essentially a language generating a text sequence, each word is generated using the same semantic vector. Obviously, this method is too simple.
To solve the abovementioned problems, a feasible solution is to introduce an attention mechanism. The ic summarization end-to-end model. As the basic model of this paper, this model generates the initial summarization and candidate action space of the DQN model.

Encoder:
The network gated unit can better express the long-term and short-term dependence. In this paper, a cyclic neural network with a gated recurrent unit (GRU [8]) is used as the basic module for constructing a document encoder. A GRU consists of u and r: where W and b represent hyperparameters of the GRU; t x represents the input word vector at time t; t h denotes the hidden layer vector ( ) σ ⋅ ; tanh( ) ⋅ is the activation function at the corresponding time; and  is the bitwise multiplication operation. The forward GRU generates a hidden layer vector representation of each corresponding word vector of the forward sequence, while the backward GRU generates the hidden layer vector of each corresponding word vector of the reverse sequence.  The optimization model for automatic text summarization Decoder and Attention mechanism: The semantic generating a text se where W and b represent hyperparameters of the GRU; t x represents the input word vector at time t; t h denotes the hidden layer vector ( ) σ ⋅ ; tanh( ) ⋅ is the activation function at the corresponding time; and  is the bitwise multiplication operation. The forward GRU generates a hidden layer vector representation of each corresponding word vector of the forward sequence, while the backward GRU generates the hidden layer vector of each corresponding word vector of the reverse sequence.  The optimization model for automatic text summarization where W and b represent hyperparameters of the GRU; t x represents the input word vector at time t; t h denotes the hidden layer vector ( ) σ ⋅ ; tanh( ) ⋅ is the activation function at the corresponding time; and  is the bitwise multiplication operation. The forward GRU generates a hidden layer vector representation of each corresponding word vector of the forward sequence, while the backward GRU generates the hidden layer vector of each corresponding word vector of the reverse sequence.  The optimization model for automatic text summarization Decoder and Attention mechanism: The semantic feature representation of the input sequence generated by the encoder is decoded as the initial input state of generating a te using the sam method is too si where W and b represent hyperparameters of the GRU; t x represents the input word vector at time t; t h denotes the hidden layer vector ( ) σ ⋅ ; tanh( ) ⋅ is the activation function at the corresponding time; and  is the bitwise multiplication operation. The forward GRU generates a hidden layer vector representation of each corresponding word vector of the forward sequence, while the backward GRU generates the hidden layer vector of each corresponding word vector of the reverse sequence.  The optimization model for automatic text summarization (6) where W and b represent hyperparameters of the GRU; x t represents the input word vector at time t; h t denotes the hidden layer vector σ (•) ; tanh(•) is the activation function at the corresponding time; and  The optimization model for automatic text summariz is the bitwise multiplication operation.
The forward GRU generates a hidden layer vector representation of each corresponding word vector of the forward sequence, while the backward GRU generates the hidden layer vector of each corresponding word vector of the reverse sequence.
Forward GRU generates forward sequences, and the hidden layer vector of each corresponding word vector represents h t f , while backward GRU generates the hidden layer vector h t b of each corresponding word vector of the reverse sequence.
rward GRU generates forward sequences, and den layer vector of each corresponding word represents f t h , while backward GRU generates den layer vector b t h of each corresponding word f the reverse sequence.
e input sequence of the original text is used as put of the encoder, and the hidden layer ntation of each position is generated by the r. The semantic representation c of the input ce can be directly assigned by the last hidden f the encoder, i.e., = m c h , or it can be the linear ntation of the last hidden layer.
ing a text sequence, each word is generated the same semantic vector. Obviously, this is too simple. solve the abovementioned problems, a feasible is to introduce an attention mechanism. The (7) The input sequence of the original text is used as the input of the encoder, and the hidden layer representation of each position is generated by the encoder. The semantic representation c of the input sequence can be directly assigned by the last hidden layer of the encoder, i.e., c = h m , or it can can be the linear representation of the last hidden layer.

Decoder and Attention mechanism:
The semantic feature representation of the input sequence generated by the encoder is decoded as the initial input state of the decoder out of the target text sequence, i.e., the summarization. The decoder is essentially a language model. This article also uses a one-way GRU, as shown in the lower right part of Figure 1. c is decoded by the decoder so it must contain all the information in the original sequence. Additionally, in the process of generating a text sequence, each word is generated using the same semantic vector. Obviously, this method is too simple.
To solve the abovementioned problems, a feasible solution is to introduce an attention mechanism. The attention mechanism gives different attention weights to different input words at each decoding time and generating each word. The semantic representation c at each time in the decoding process adaptively selects the most appropriate context information for the target y i to be output at the current time. Specifically, we use a ij to measure the correlation of the encoder's hidden layer representation h j at time j and decoding at time i. Finally, the context information ci input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of h j and a ij .
ppropriate context information for the target i y utput at the current time. Specifically, we use ij a sure the correlation of the encoder's hidden layer entation j h at time j and decoding at time i.
, the context information i c input by the er at time i is equal to the hidden layer of the r at all times, which represents the weighted f j h and ij a .  < > are used as DQN internal states. When i=0, the initial summarization generated by the decoder consists of the vocabulary of the maximum output probability of each time decoder.
Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time (8) where, representation j h at time j and decoding at time i.
Finally, the context information i c input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of j h and ij a . where, Using the decoder with the attention mechanism, the model of equation (1) can be expressed as where ( ) g ⋅ is a function that outputs the probability of the target vocabulary i y . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: where z w represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters. Given the abstract text sequence ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence In the decoding process, the decoder sequentially generates a hidden layer state representation Parameter training: The D is a process of minimizing th actual training process, Two improve the stability [21]: A Qparameter θ original network with parameter θ generates th the Q-value learning update pr q-value, generated by the ta parameter θ according to th process can be seen as trainin future rewards. (9) to measure the correlation of the encoder's hidden layer representation j h at time j and decoding at time i.
Finally, the context information i c input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of j h and ij a . where, Using the decoder with the attention mechanism, the model of equation (1) can be expressed as where ( ) g ⋅ is a function that outputs the probability of the target vocabulary i y . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: where z w represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters. Given the abstract text sequence ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence In the decoding process, the decoder sequentially generates a hidden layer state representation states. When i=0, the init by the decoder consists maximum output probabi Action: In the decodi hidden layer vector is distribution of V vocabul softmax function. At the sa attention weight distributi sequence at each time. Th the vocabulary probability maximum probabilities of selected as the candidate a time DQN, as shown in F set generated by the seq2s is an "abstractive summa and the action candidate based on the summa distribution can be re summarization process.

Reward:
In iteration s current summarization se the reference summarizat c r . The DQN selects action This process generates a n which is the optimized s score of the abstract versu reward for the DQN's curr reward=r Parameter training: T is a process of minimizi actual training process, improve the stability [21]: parameter θ original net with parameter θ generat the Q-value learning upda q E r Q λ ′ + q-value, generated by th parameter θ according t process can be seen as t future rewards. (10) ij to measure the correlation of the encoder's hidden layer representation j h at time j and decoding at time i.
Finally, the context information i c input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of j h and ij a . where, Using the decoder with the attention mechanism, the model of equation (1) can be expressed as where ( ) g ⋅ is a function that outputs the probability of the target vocabulary i y . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: where z w represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters. Given the abstract text sequence ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence In the decoding process, the decoder sequentially generates a hidden layer state representation q-value, generated by parameter θ accordin process can be seen a future rewards. (11) Using the decoder with the attention mechanism, the model of equation (1) can be expressed as to be output at the current time. Specifically, we use ij a to measure the correlation of the encoder's hidden layer representation j h at time j and decoding at time i.
Finally, the context information i c input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of j h and ij a . where, Using the decoder with the attention mechanism, the model of equation (1) can be expressed as where ( ) g ⋅ is a function that outputs the probability of the target vocabulary i y . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: where z w represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters.
Given the abstract text sequence ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence  rew Parameter traini is a process of min actual training proc improve the stability parameter θ origina with parameter θ ge the Q-value learning q-value, generated parameter θ accord process can be seen future rewards. (12) where g(•) nction that outputs the probability of the target vocabulary y i . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: most appropriate context information for the target i y to be output at the current time. Specifically, we use ij a to measure the correlation of the encoder's hidden layer representation j h at time j and decoding at time i.
Finally, the context information i c input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of j h and ij a . where, Using the decoder with the attention mechanism, the model of equation (1) can be expressed as where ( ) g ⋅ is a function that outputs the probability of the target vocabulary i y . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: where z w represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters. Given the abstract text sequence ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence   (13) where w z represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters. Given the abstract text sequence <x 1 ,..., x j ..., x m > and the target sequence <y 1 ,..., y j ..., y N > (source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixed-dimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence most appropriate context information for the target i y to be output at the current time. Specifically, we use ij a to measure the correlation of the encoder's hidden layer representation j h at time j and decoding at time i.
Finally, the context information i c input by the decoder at time i is equal to the hidden layer of the encoder at all times, which represents the weighted sum of j h and ij a . where, Using the decoder with the attention mechanism, the model of equation (1) can be expressed as where ( ) g ⋅ is a function that outputs the probability of the target vocabulary i y . A softmax function is usually used to map a hidden layer vector into a probability distribution of V categories (i.e., vocabulary) of the vocabulary: where z w represents the weight matrix.

DQN Decoding
The four components of the reinforcement learning model are described in detail: states, actions, strategies, and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters. Given the abstract text sequence ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence   Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time based on the summarization original attention distribution can be regarded as an extractive summarization process.
Strategy: Given the current state, i.e., the current iteration step summarization sequence priate context information for the target i y t at the current time. Specifically, we use ij a the correlation of the encoder's hidden layer ion j h at time j and decoding at time i. e context information i c input by the time i is equal to the hidden layer of the all times, which represents the weighted nd ij a .
the decoder with the attention mechanism, f equation (1) can be expressed as is a function that outputs the probability of ocabulary i y . A softmax function is usually p a hidden layer vector into a probability of V categories (i.e., vocabulary) of the : epresents the weight matrix.
Decoding omponents of the reinforcement learning described in detail: states, actions, strategies, s. Automatically generate semantic features t sequence by deploying a recurrent neural ith a coder-decoder architecture with a DQN directly uses these features to learn e function to estimate future rewards and to future long-term rewards by optimizing meters. the abstract text sequence Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time based on the summarization original attention distribution can be regarded as an extractive summarization process.
Strategy: Given the current state, i.e., the current iteration step summarization sequence Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process. .
mechanism, d as ) , < > are used as DQN internal states. When i=0, the initial summarization generated by the decoder consists of the vocabulary of the maximum output probability of each time decoder.
Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time based on the summarization original attention distribution can be regarded as an extractive summarization process.
Strategy: Given the current state, i.e., the current iteration step summarization sequence (14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process.
coder with the attention mechanism, ation (1) can be expressed as unction that outputs the probability of lary i y . A softmax function is usually dden layer vector into a probability categories (i.e., vocabulary) of the ing onents of the reinforcement learning bed in detail: states, actions, strategies, tomatically generate semantic features uence by deploying a recurrent neural a coder-decoder architecture with a directly uses these features to learn tion to estimate future rewards and to e long-term rewards by optimizing rs.
stract text sequence ,..., y ..., N y y < >(source text and ation in Figure 1), the Bi-GRU encoder tire input sequence into a fixedantic vector that is decoded as the or of the decoder to generate the actual zation sequence < > are used as DQN internal states. When i=0, the initial summarization generated by the decoder consists of the vocabulary of the maximum output probability of each time decoder.
Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time based on the summarization original attention distribution can be regarded as an extractive summarization process.
Strategy: Given the current state, i.e., the current iteration step summarization sequence (14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process.  < > are used as DQN internal states. When i=0, the initial summarization generated by the decoder consists of the vocabulary of the maximum output probability of each time decoder.
Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time based on the summarization original attention distribution can be regarded as an extractive summarization process.
Strategy: Given the current state, i.e., the current iteration step summarization sequence (14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process.  < > are used as DQN internal states. When i=0, the initial summarization generated by the decoder consists of the vocabulary of the maximum output probability of each time decoder.
Action: In the decoding stage of the decoder, the hidden layer vector is mapped to the probability distribution of V vocabularies in the vocabulary by a softmax function. At the same time, it also produces the attention weight distribution of the summarization text sequence at each time. The vocabulary that generates the vocabulary probability distribution and the first k maximum probabilities of the attention distribution are selected as the candidate action space (size 2k) for each time DQN, as shown in Figure 1. The action candidate set generated by the seq2seq model encoding-decoding is an "abstractive summarization" generation process, and the action candidate set generated at each time based on the summarization original attention distribution can be regarded as an extractive summarization process.
Strategy: Given the current state, i.e., the current iteration step summarization sequence (14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process.  is the target q-value, generated by the target network with the parameter θ according to the next state 1 i s + . This process can be seen as training the DQN to predict future rewards. and the reference summarization <y 1 ,..., y j ..., y N > is set to r c . The DQN selects actions based on the current policy. This process generates a new state on text input sequence d the actual summarization ˆ ., y i T > are used as DQN internal e initial summarization generated nsists of the vocabulary of the obability of each time decoder. ecoding stage of the decoder, the r is mapped to the probability cabularies in the vocabulary by a the same time, it also produces the ribution of the summarization text e. The vocabulary that generates ability distribution and the first k ies of the attention distribution are date action space (size 2k) for each in Figure 1. The action candidate seq2seq model encoding-decoding mmarization" generation process, idate set generated at each time mmarization original attention e regarded as an extractive ess.
the current state, i.e., the current summarization sequence d the action space at each time, the mates the value function (Q-value) rrent state and selects the action of ue at each time as the output. More h iteration step at time t, the DQN t y i  to replace the state t y i time s. Therefore, this process will e of the state, i.e., a new state . tion step i, the ROUGE score of the ion sequence 1ˆŷ ,..., y ..., y (14) ing: The DQN parameter training imizing the loss function. In the cess, Two networks are used to [21]: A Q-value estimate with the l network and a target network enerates the target q-value during update process.  < > are used as DQN internal tes. When i=0, the initial summarization generated the decoder consists of the vocabulary of the aximum output probability of each time decoder.
Action: In the decoding stage of the decoder, the dden layer vector is mapped to the probability stribution of V vocabularies in the vocabulary by a ftmax function. At the same time, it also produces the ention weight distribution of the summarization text quence at each time. The vocabulary that generates e vocabulary probability distribution and the first k aximum probabilities of the attention distribution are lected as the candidate action space (size 2k) for each e DQN, as shown in Figure 1. The action candidate t generated by the seq2seq model encoding-decoding an "abstractive summarization" generation process, d the action candidate set generated at each time sed on the summarization original attention stribution can be regarded as an extractive mmarization process.
Strategy: Given the current state, i.e., the current ration step summarization sequence (14) Parameter training: The DQN parameter training a process of minimizing the loss function. In the tual training process, Two networks are used to prove the stability [21]: A Q-value estimate with the rameter θ original network and a target network th parameter θ generates the target q-value during e Q-value learning update process.  . (14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with pa-rameter θθ generates the target q-value during the Q-value learning update process. the input sequence by deploying a recurrent neural twork with a coder-decoder architecture with a ted unit. DQN directly uses these features to learn e Q-value function to estimate future rewards and to aximize future long-term rewards by optimizing odel parameters.
Given the abstract text sequence ,..., y ..., N y y < >(source text and rget summarization in Figure 1), the Bi-GRU encoder codes the entire input sequence into a fixedmensional semantic vector that is decoded as the itial input vector of the decoder to generate the actual itial summarization sequence which is the optimized summarization. The ROUGE score of the abstract versus the real abstract is r, so the reward for the DQN's current action is n c reward=r -r .
(14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during is the target q-value, generated by the target network with the parameter θ according to the next state 1 i s + . This process can be seen as training the DQN to predict future rewards. , (15) where and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters.
Given the abstract text sequence 1 j m ,..., ..., x x x < > and the target sequence 1 j ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence ,..., y ..., y i i i j T < > , which is the optimized summarization. The ROUGE score of the abstract versus the real abstract is r, so the reward for the DQN's current action is n c reward=r -r .
(14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process. is the target q-value, generated by the target network with the parameter θ according to the next state 1 i s + . This process can be seen as training the DQN to predict future rewards.
is the target q-value, generated by the target network with the parameter θ according to the next state and rewards. Automatically generate semantic features of the input sequence by deploying a recurrent neural network with a coder-decoder architecture with a gated unit. DQN directly uses these features to learn the Q-value function to estimate future rewards and to maximize future long-term rewards by optimizing model parameters.
Given the abstract text sequence 1 j m ,..., ..., x x x < > and the target sequence 1 j ,..., y ..., N y y < >(source text and target summarization in Figure 1), the Bi-GRU encoder encodes the entire input sequence into a fixeddimensional semantic vector that is decoded as the initial input vector of the decoder to generate the actual initial summarization sequence ,..., y ..., y i i i j T < > , which is the optimized summarization. The ROUGE score of the abstract versus the real abstract is r, so the reward for the DQN's current action is n c reward=r -r .
(14) Parameter training: The DQN parameter training is a process of minimizing the loss function. In the actual training process, Two networks are used to improve the stability [21]: A Q-value estimate with the parameter θ original network and a target network with parameter θ generates the target q-value during the Q-value learning update process. . This process can be seen as training the DQN to predict future rewards.
In the training process, in every C update training, the parameter θ of the original network is assigned to the parameters of the target network, i.e., In the training process, in every C update training, the parameter θ of the original network is assigned to the parameters of the target network, i.e., ˆ= θ θ . Additionally, experience pool technology [25] is used in training to store the experience of each iteration step. DQN selects the action of maximum Q to maximize future expectations DQN randomly sel to ensure sufficient specific parameter algorithm 1.  . Additionally, experience pool technology [25] is used in training to store the experience of each iteration step. DQN selects the action of maximum Q to maximize future expectations. In the initial stage of training, DQN randomly selects actions with probability ε [20] to ensure sufficient exploration of the state space. Its specific parameter training process is referenced in algorithm 1. rocess, in every C update training, the original network is assigned to the target network, i.e., ˆ= θ θ . ence pool technology [25] is used in experience of each iteration step. tion of maximum Q to maximize future expectations. In the initial stage of training, DQN randomly selects actions with probability ε [20] to ensure sufficient exploration of the state space. Its specific parameter training process is referenced in algorithm 1. and an action candidate space A; 6.
if the value is less than ε 8.
Randomly select an action a i (i.e., word) from the action space A at time t; 9. else 10.
Calculate the Q-value function Q(s, a; θ), let action a i = arg max a Q(s, a; θ); 11. end if 12. Using the action a i instead of generating the element (word) corresponding to the time of the summarization sequence ss, in every C update training, riginal network is assigned to target network, i.e., ˆ= θ θ . pool technology [25] is used in erience of each iteration step. of maximum Q to maximize future expectations. In the initial stage of training, DQN randomly selects actions with probability ε [20] to ensure sufficient exploration of the state space. Its specific parameter training process is referenced in algorithm 1. a new summarization sequence is generated; 13. Calculate the similarity between the newly summarization sequence In the training process, in every C update training, the parameter θ of the original network is assigned to the parameters of the target network, i.e., ˆ= θ θ . Additionally, experience pool technology [25] is used in training to store the experience of each iteration step. DQN selects the action of maximum Q to maximize future exp DQN rand to ensure s specific pa algorithm 1  and the real summarization sequence y and return the reward r i ; 14. Store state experience In the training process, in every C update training, the parameter θ of the original network is assigned to the parameters of the target network, i.e., ˆ= θ θ . Additionally, experience pool technology [25] is used in training to store the experience of each iteration step. DQN selects the action of maximum Q to maximize future expectat DQN randomly to ensure suffic specific parame algorithm 1.  In the training process, in every C update training, the parameter θ of the original network is assigned to the parameters of the target network, i.e., ˆ= θ θ . Additionally, experience pool technology [25] is used in training to store the experience of each iteration step. DQN selects the action of maximum Q to maximize future expectations. In the initial stage of training, DQN randomly selects actions with probability ε [20] to ensure sufficient exploration of the state space. Its specific parameter training process is referenced in algorithm 1. The original sequence x is input to the pretrained seq2seq model to generate an initial summarization sequence 0 y and an 13. Calculate the similarity between the newly summarization sequence i+1 y and the real summarization sequence y and return the reward i r ; 14.

Experiments
This section details the summarization of the evaluation indicators, datasets and automatic summarization generation-related comparison algorithms and tests our methods on two datasets (LCSTS and CNN/ DailyMail).

Evaluation
In this paper, the ROUGE [23] evaluation system is used to automatically evaluate the summarization. ROUGE evaluates the quality of the summarization between the reference summarization and the summarization generated by the abstract system. ROUGE includes a series of evaluation methods. The summarization task usually uses ROUGE-N and ROUGE-L.

4. . E Ex xp pe er ri im me en nt ts s
This section details the summarization of the evaluation indicators, datasets and automatic summarization generation-related comparison algorithms and tests our methods on two datasets (LCSTS and CNN/DailyMail).

Evaluation
In this paper, the ROUGE [23] evaluation system is used to automatically evaluate the summarization. ROUGE evaluates the quality of the summarization between the reference summarization and the summarization generated by the abstract system. ROUGE includes a series of evaluation methods. The summarization task usually uses ROUGE-N and ROUGE-L.
where n is the length of the n-gram, which takes 1 and 2 in this evaluation. The formula for calculating accuracy P, recall rate R and F of ROUGE-L is as follows: P= n where X is a reference summarization and has a length of m, Y is summarization generated by the system, the and the corresponding summarization to 400 and 100.

Experiment Setup
The encoder is built using a single-layer bidirectional Bi-GRU as the base module. In addition, a backward GRU and a forward GRU of the decoder are used together to form our DQN. The number of GRU hidden layers is set to 256. Model network parameters are initialized using a uniform distribution of intervals [-0.1, 0.1]. The learning rate and batch size are set to 0.05 and 32, respectively. We first pretrain the base model on a given dataset until it reaches a convergence state and then train the DQN.
In the initial stage of DQN training, the state space exploration parameter is set to 1.0, and the step is gradually reduced to 0.1 in 1,000 steps. Given that the length of the output sequence is l, its DQN iteration number is set to 2l [15]. The iteration termination threshold is set to 0.9. The DQN experience pool size is set to 200,000, the reward discount factor is set to 0.9, and the action space size is set to 20; that is, the 10 most likely words from the vocabulary output distribution and the attention distribution are chosen.
In the LCSTS dataset, the first 50,000 words in the training set were selected to form a vocabulary. Other words were uniformly represented by the <UNK> tag, and the word vector dimension was set to 200.
On the CNN/DailyMail dataset, the first 25,000 most frequently appeared words were selected to form the vocabulary, and other words were uniformly represented by the <UNK> tag. The word vector dimension is set to 128.

Evaluation
In this paper, the ROUGE [23] evaluation system is used to automatically evaluate the summarization. ROUGE evaluates the quality of the summarization between the reference summarization and the summarization generated by the abstract system. ROUGE includes a series of evaluation methods. The summarization task usually uses ROUGE-N and ROUGE-L.
where n is the length of the n-gram, which takes 1 and 2 in this evaluation. The formula for calculating accuracy P, recall rate R and F of ROUGE-L is as follows: P= n where X is a reference summarization and has a length and the corresponding summarization to 400 and 100.

Experiment Setup
The encoder is built using a single-layer bidirectional Bi-GRU as the base module. In addition, a backward GRU and a forward GRU of the decoder are used together to form our DQN. The number of GRU hidden layers is set to 256. Model network parameters are initialized using a uniform distribution of intervals [-0.1, 0.1]. The learning rate and batch size are set to 0.05 and 32, respectively. We first pretrain the base model on a given dataset until it reaches a convergence state and then train the DQN.
In the initial stage of DQN training, the state space exploration parameter is set to 1.0, and the step is gradually reduced to 0.1 in 1,000 steps. Given that the length of the output sequence is l, its DQN iteration number is set to 2l [15]. The iteration termination threshold is set to 0.9. The DQN experience pool size is set to 200,000, the reward discount factor is set to 0.9, and the action space size is set to 20; that is, the 10 most likely words from the vocabulary output distribution and the attention distribution are chosen.
In the LCSTS dataset, the first 50,000 words in the training set were selected to form a vocabulary. Other words were uniformly represented by the <UNK> tag, and the word vector dimension was set to 200.
On the CNN/DailyMail dataset, the first 25,000 most frequently appeared words were selected to form the vocabulary, and other words were uniformly represented by the <UNK> tag. The word vector dimension is set to 128. (17) p pe er ri im me en nt ts s ction details the summarization of the evaluation rs, datasets and automatic summarization ion-related comparison algorithms and tests our s on two datasets (LCSTS and CNN/DailyMail).
is the length of the n-gram, which takes 1 and 2 valuation. e formula for calculating accuracy P, recall rate R f ROUGE-L is as follows: P= n and the corresponding summarization to 400 and 100.

Experiment Setup
The encoder is built using a single-layer bidirectional Bi-GRU as the base module. In addition, a backward GRU and a forward GRU of the decoder are used together to form our DQN. The number of GRU hidden layers is set to 256. Model network parameters are initialized using a uniform distribution of intervals [-0.1, 0.1]. The learning rate and batch size are set to 0.05 and 32, respectively. We first pretrain the base model on a given dataset until it reaches a convergence state and then train the DQN.
In the initial stage of DQN training, the state space exploration parameter is set to 1.0, and the step is gradually reduced to 0.1 in 1,000 steps. Given that the length of the output sequence is l, its DQN iteration number is set to 2l [15]. The iteration termination threshold is set to 0.9. The DQN experience pool size is set to 200,000, the reward discount factor is set to 0.9, and the action space size is set to 20; that is, the 10 most likely words from the vocabulary output distribution and the attention distribution are chosen.
In the LCSTS dataset, the first 50,000 words in the training set were selected to form a vocabulary. Other words were uniformly represented by the <UNK> tag, and the word vector dimension was set to 200.
On the CNN/DailyMail dataset, the first 25,000 most frequently appeared words were selected to form the vocabulary, and other words were uniformly (18) where n is the length of the n-gram, which takes 1 and 2 in this evaluation.
The formula for calculating accuracy P, recall rate R and F of ROUGE-L is as follows: where n is the length of the n-gram, which takes 1 and 2 in this evaluation. The formula for calculating accuracy P, recall rate R and F of ROUGE-L is as follows: P= n where X is a reference summarization and has a length of m, Y is summarization generated by the system, the length is n, LCS ⋅ is used to measure the longest common subsequence and β is a hyperparameter.

Dataset LCSTS dataset[19]
: LCSTS is a Chinese short text summarization dataset. Each piece of data was collected from Sina Weibo in the form of a {short news, summarization} pair. The dataset includes PARTI, PARTII, and PARTIII. The specific statistical information is shown in Table 1. In this paper, using the same settings as in [14]. CNN/DailyMail dataset: The CNN/DailyMail dataset is an artificially generated summarization dataset constructed by Herman [18] based on news articles. This dataset can be obtained directly from GitHub. This paper uses the Stanford CoreNLP tool [28] to preprocess the data. We set the maximum length of the abstract text exploration parameter is gradually reduced to 0.1 in length of the output sequ number is set to 2l [15] threshold is set to 0.9. The set to 200,000, the reward and the action space size is likely words from the voc and the attention distributio In the LCSTS dataset, training set were selected t words were uniformly rep and the word vector dimens On the CNN/DailyM most frequently appeared w the vocabulary, and oth represented by the <UN dimension is set to 128.

Experimental Results a
1) Evaluation results on Except for the "Bi-GRU method and comparison m the word segmentation as the model. The experimenta 2, where "Bi-GRU" represe paper. Additionally, an atte the decoding process to ge using a beam search and d "DQN" represents the opti results of the basic model a network. On this dataset, t use are as follows.
RNN: the benchmark [19] on the LCSTS dataset network as the implement (19) where n is the length of the n-gram, which takes 1 and 2 in this evaluation. The formula for calculating accuracy P, recall rate R and F of ROUGE-L is as follows: P= n where X is a reference summarization and has a length of m, Y is summarization generated by the system, the length is n, LCS ⋅ is used to measure the longest common subsequence and β is a hyperparameter.

Dataset LCSTS dataset[19]
: LCSTS is a Chinese short text summarization dataset. Each piece of data was collected from Sina Weibo in the form of a {short news, summarization} pair. The dataset includes PARTI, PARTII, and PARTIII. The specific statistical information is shown in Table 1. In this paper, using the same settings as in [14]. CNN/DailyMail dataset: The CNN/DailyMail dataset is an artificially generated summarization dataset constructed by Herman [18] based on news articles. This dataset can be obtained directly from GitHub. This paper uses the Stanford CoreNLP tool [28] to preprocess the data. We set the maximum length of the abstract text In the initial stage of DQ exploration parameter is gradually reduced to 0.1 in length of the output sequ number is set to 2l [15] threshold is set to 0.9. The set to 200,000, the reward and the action space size is likely words from the voc and the attention distributio In the LCSTS dataset, training set were selected words were uniformly rep and the word vector dimen On the CNN/DailyM most frequently appeared the vocabulary, and oth represented by the <UN dimension is set to 128.

Experimental Results a
1) Evaluation results on Except for the "Bi-GRU method and comparison m the word segmentation as the model. The experiment 2, where "Bi-GRU" represe paper. Additionally, an atte the decoding process to g using a beam search and d "DQN" represents the opt results of the basic model a network. On this dataset, t use are as follows.
RNN: the benchmark [19] on the LCSTS dataset network as the implement (20) where n is the length of the n-gram, which takes 1 and 2 in this evaluation. The formula for calculating accuracy P, recall rate R and F of ROUGE-L is as follows: P= n where X is a reference summarization and has a length of m, Y is summarization generated by the system, the length is n, ( ) LCS ⋅ is used to measure the longest common subsequence and β is a hyperparameter.

Dataset LCSTS dataset[19]
: LCSTS is a Chinese short text summarization dataset. Each piece of data was collected from Sina Weibo in the form of a {short news, summarization} pair. The dataset includes PARTI, PARTII, and PARTIII. The specific statistical information is shown in Table 1. In this paper, using the same settings as in [14]. CNN/DailyMail dataset: The CNN/DailyMail dataset is an artificially generated summarization dataset constructed by Herman [18] based on news articles. This dataset can be obtained directly from GitHub. This paper uses the Stanford CoreNLP tool [28] to preprocess the data. We set the maximum length of the abstract text it reaches a convergence In the initial stage of D exploration parameter is gradually reduced to 0.1 length of the output seq number is set to 2l [1 threshold is set to 0.9. Th set to 200,000, the rewar and the action space size likely words from the v and the attention distribu In the LCSTS datase training set were selected words were uniformly re and the word vector dime On the CNN/Daily most frequently appeared the vocabulary, and o represented by the <U dimension is set to 128.

Experimental Result 1) Evaluation results
Except for the "Bi-G method and comparison the word segmentation a the model. The experime 2, where "Bi-GRU" repre paper. Additionally, an a the decoding process to using a beam search and "DQN" represents the op results of the basic model network. On this dataset use are as follows.
RNN: the benchmar [19] on the LCSTS datase network as the impleme , (21) where X is a reference summarization and has a length of m, Y is summarization generated by the system, the length is n, LCS(•) is used to measure the longest common subsequence and β is a hyperparameter.

Dataset
LCSTS dataset [19]: LCSTS is a Chinese short text summarization dataset. Each piece of data was collected from Sina Weibo in the form of a {short news, summarization} pair. The dataset includes PARTI, PARTII, and PARTIII. The specific statistical information is shown in Table 1. In this paper, using the same settings as in [14]. CNN/DailyMail dataset: The CNN/DailyMail dataset is an artificially generated summarization dataset constructed by Herman [18] based on news articles. This dataset can be obtained directly from GitHub. This paper uses the Stanford CoreNLP tool [28] to preprocess the data. We set the maximum length of the abstract text and the corresponding summarization to 400 and 100.

Experiment Setup
The encoder is built using a single-layer bidirectional Bi-GRU as the base module. In addition, a backward GRU and a forward GRU of the decoder are used to-gether to form our DQN. The number of GRU hidden layers is set to 256. Model network parameters are initialized using a uniform distribution of intervals [-0.1, 0.1]. The learning rate and batch size are set to 0.05 and 32, respectively. We first pretrain the base model on a given dataset until it reaches a convergence state and then train the DQN. In the initial stage of DQN training, the state space exploration parameter is set to 1.0, and the step is gradually reduced to 0.1 in 1,000 steps. Given that the length of the output sequence is l, its DQN iteration number is set to 2l [15]. The iteration termination threshold is set to 0.9. The DQN experience pool size is set to 200,000, the reward discount factor is set to 0.9, and the action space size is set to 20; that is, the 10 most likely words from the vocabulary output distribution and the attention distribution are chosen.
In the LCSTS dataset, the first 50,000 words in the training set were selected to form a vocabulary. Other words were uniformly represented by the <UNK> tag, and the word vector dimension was set to 200. On the CNN/DailyMail dataset, the first 25,000 most frequently appeared words were selected to form the vocabulary, and other words were uniformly represented by the <UNK> tag. The word vector dimension is set to 128.

Experimental Results and Analysis
1 Evaluation results on the LCSTS dataset: Except for the "Bi-GRU + Distraction" method, the method and comparison method use the basic unit after the word segmentation as the word vector to input to the model. The experimental results are shown in Table 2, where "Bi-GRU" represents the basic model of this paper. Additionally, an attention mechanism is used in the decoding process to generate the final abstract by using a beam search and decoding with a width of 10. "DQN" represents the optimization model; that is, the results of the basic model are optimized by the depth Q network. On this dataset, the comparison methods we use are as follows. RNN: the benchmark model first proposed by Hu [19] on the LCSTS dataset only uses a recurrent neural network as the implementation summarization of the encoder and decoder. RNN context: A reinforcement model proposed by Hu [19]. The difference is that in the decoding process, all hidden layer states of the encoder are input to the decoder as context.
COPYNET: A replication mechanism was proposed by Gu [14] and incorporated into sequence-to-sequence learning. COPYNET can combine the replication mechanism with the sequence generation process of the traditional decoder so that it can directly select the corresponding subsequence. Bi-GRU + Distraction: A novel attention mechanism method proposed by Chen [4], which differs from the above method in that the test result is based on characters as a basic unit, that is, a Chinese character. The basic model "Bi-GRU" used in this paper and the "RNN context" method used by Hu [19] are based on GRU construction. Unlike the latter, our basic model constructs an encoder and also adds attention mechanisms in the decoding process. Therefore, our base model performs better than the "RNN context" method. The "COPYNET" method and the "Bi-GRU + Distraction" method can be seen as introducing a replication mechanism and a new attention mechanism on top of our basic model. Our optimization model "DQN" introduces the optimization process of reinforcement learning to achieve the initial summarization of the basic model. From the experimental results, we can conclude that our optimization model achieves the best results.
2 Evaluation results on the CNN/DailyMail dataset: As seen in Table 3, the comparison method we used includes some methods used on LCSTS data, including TextRank [30], LexRank [11], Luhn [26], Edmundson [10], LSA [44], Sum-basic [15] and KL-sum [16]. The experimental results of these methods can be achieved through the open source tool SUMY.  From the experimental results, we can conclude that our optimization model achieves the best results.
In the two datasets, the optimization effect in the ROUGE-1 evaluation was significantly higher than that of ROUGE-2 and ROUGE-L. Since our DQN model is more likely to be rewarded for selecting an individual action from the action candidate space each time in the iterative optimization summarization, the ROUGE-1 score is more likely to be updated.

Conclusions
This paper proposes a reinforcement text summarization optimization method based on deep enhanced learning. An attention mechanism-based seq2seq model is used to generate the initial summarization and the action candidate space required for reinforcement learning, and then the deep Q network is used to optimize the initial summarization on the action candidate space. The experimental results show that the effect of the optimized method obviously improved. Since the results generated by the base model limit the final performance of our optimization method, in future work, we consider applying reinforcement learning directly to optimize the parameters of the base model to obtain better results.