Deep Learning Methods in Short-Term Traffic Prediction: A Survey

spread concern and research. Currently, the most widely used model for short-term traffic prediction are deep learning models. This survey studied the relevant literature on the use of deep learning models to solve short-term traffic prediction problem in the top journals of transportation in recent years, summarized the current commonly used traffic datasets, the mainstream deep learning models and their applications in this field. Finally, the challenges and future development trends of deep learning models applied in this field are discussed.


Introduction
Due to population growth and the development of vehicular transportation and urbanization, the contradiction between urban traffic demand and supply is becoming increasingly acute, and the traffic congestion caused by this has become a serious problem plaguing many cities around the world [48]. Traffic congestion not only causes a large amount of time waste and hinders the development of urban economy, but also causes a sharp increase in vehicle exhaust emissions, which has a serious impact on the surrounding environment.
As an important component of Intelligent Transportation System (ITS) [58], traffic prediction is currently one of the effective ways to alleviate traffic congestion. Traffic prediction uses historical traffic data to predict future traffic data. Accurate traffic prediction can provide important decision-making basis for related department to implement traffic control. At the same time, it can also provide real-time road condition information for urban residents, help them avoid congested road sections, improve their travel comfort and alleviate urban traffic pressure. In addition, compared with road restriction and road reconstruction, traffic prediction has the advantages of low cost and easy implementation [75]. Meanwhile, in recent years, with the maturity and development of the Internet of Vehicles (IoV) [29] and ITS, the real-time requirements for traffic prediction are gradually increasing. In this case, short-term traffic prediction, which has shorter prediction time span and stronger real-time performance, has received extensive attention and research.
Due to the development of traffic sensors and Global Positioning System (GPS) technology, the volume of traffic data collected in a short period of time has greatly increased. In the face of current large volumes of multi-source traffic data, traditional short-term traffic prediction methods have been difficult to learn its complex data characteristics. In this regard, the deep learning model with flexible combinations and significant effects in fitting the nonlinear characteristics of massive data [51] has been widely used and become the mainstream model in this field. However, as the types of deep learning models applied to this field continue to increase and the structure of deep learning models become more and more complex, how to construct an appropriate deep learning model for short-term traffic prediction in corresponding scenarios is a difficulty in this field at present.
The main audience of this paper is researchers who are interested in how to better apply deep learning models to short-term traffic prediction problem. In order to give readers a comprehensive and profound understanding of this survey topic, based on 32 relevant literature mostly from the top journals of transportation from 2015 to 2021, we made a comprehensive and detailed summary of the commonly used traffic datasets, mainstream deep learning models and their applications in this field. In addition, we also discussed the development challenges and future development trends of deep learning models applied to short-term traffic prediction.
The main contributions of this paper are as follows: _ Summarize and introduce the most commonly used traffic datasets in short-term traffic prediction _ Summarize and introduce the deep learning models commonly used in short-term traffic prediction and their advantages and disadvantages in this field _ A detailed and comprehensive introduction to the application of current commonly used deep learning models in short-term traffic prediction _ Summarize the development challenges and future development trends of deep learning models in short-term traffic prediction The rest of this paper is structured as follows: Section 2 introduces the problem definition, development history and applications of short-term traffic prediction. Section 3 introduces the commonly used traffic datasets mainstream deep learning models and their applications in short-term traffic prediction. Section 4 introduces the development challenges and trends of deep learning models in the field of short-term traffic prediction, and Section 5 summarizes this paper.

Problem Definition, Development History and Applications
In this section, we will first introduce the definition of short-term traffic prediction problem, then we will outline a brief development history of it. And in this process, we will show the advantages of deep learning models compared with other methods. Finally, we will introduce the main applications of short-term traffic prediction at present.

Problem Definition
Traffic prediction refers to the use of historical traffic data to predict future traffic data. Traffic data can be traffic flow data, traffic speed data, travel time data, traffic state data, and other traffic-related data. Traffic prediction with the forecast time span of less than 15 minutes is called short-term traffic prediction [20]. Set the traffic time series data as X , According to the characteristic of short-term traffic prediction, it can be defined as Equation (1): (1) In the above formula, the function f represents the prediction method. h X represents a period of historical traffic data that used to predict the traffic data in t time slice. Example, if you use K time slices to predict the traffic data in t time slice, ˆt Y represents the predicted value in t time slice.
For short-term traffic prediction based on mathematical statistics, formula (1) can show the definition of it. However, for the short-term traffic prediction based on machine learning or deep learning, after the prediction value is obtained, there is a process of feedback prediction results and repeated training to minimize the error. Therefore, in addition to formula (1), the mathematical formula for such short-term traffic prediction should include Equation (2) as follows: In the above formula, cost function is loss function, ˆt Y is the predicted value, t Y is the real value, opt is the optimizer used in the process of minimizing the loss function. Machine learning model or deep learning model will use the selected optimizer to update the model parameters to minimize the loss function, so as to improve the prediction accuracy.

Development History
Short-term traffic prediction has undergone more than 40 years of development [80]. There are three main types of models used in short-term traffic prediction: mathematical statistics models, machine learning models and deep learning models. Mathematical statistics models predetermine the distribution of data according to the theoretical hypothesis, but it is difficult to meet the prior conditions in practice, which limits the development of the model, especially for nonlinear data, it is difficult for them to obtain effective results. They mainly include Gray model (GM(1,1)) [63], Kalman filter model [22] and Autoregressive Integrated Moving Average model (ARIMA) [84,44,53].
Because the machine learning models can well learn the non-linear data rule of traffic data, they have higher prediction accuracy and better prediction effect than mathematical statistical models. machine learning models used in short-term traffic prediction mainly include K-nearest neighbors regression model [88], support vector regression model (SVR) [35] and neural network (NN) [79,94]. Although the machine learning models are suitable for learning the data rules of nonlinear data, they are still difficult to deal with large and complex data relationships. At the same time, the feature engineering [65] used on the data is very complicated and requires a lot of effort.
Deep learning models are composed of multilayer neural networks, they have complex structures and numerous parameters, which enable them to obtain the best prediction results when extracting data characteristics of large and complex data. Moreover, due to the high complexity of deep learning model, good prediction results can be obtained without feature engineering, which saves a lot of manual work. With the increase of traffic data collection sources and data volume, deep learning models have become the mainstream modeling solutions for short-term traffic prediction. The deep learning models mainly used in this field are convolutional neural network (CNN) [18][19], recurrent neural network (RNN) [20][21]. Although deep learning models can learn complex nonlinear characteristics of data, they still have some drawbacks, such as require more training data and time and that they generally have a lower model interpretability.

Applications
At present, short-term traffic prediction is mainly applied in ITS and IoV system to provide real-time and accurate future urban road condition information for the system to assist the normal operation of relevant modules. In ITS, the system performs intelligent control on traffic signal based on short-term traffic prediction, such as switching traffic phase and adjusting the length of traffic signal, so as to carry out macro-control on urban road sections and alleviate traffic congestion [40,1]. In addition, based on shortterm traffic prediction, Advanced Traveler Information Systems (ATIS) of ITS provides travelers with various information such as travel mode selection, route selection and departure time selection, so as to help travelers make better travel plans, improve their travel comfort and relieve road traffic pressure [74]. In the IoV system, the system provides real-time and intelligent driving route planning for drivers based on short-term traffic prediction, enabling drivers to avoid congested road sections, improve driving comfort and effectively alleviate traffic congestion [83].
In addition to assisting the intelligent decision-making of systems, short-term traffic prediction also provides an important decision-making basis for traffic management department to formulate effective traffic management measures to solve the large-scale traffic congestion problem in the city [7].

Deep Learning Models Used in Short-term Traffic Prediction
In this section, we will first introduce a table ( Table 1) that covers the literature we collected in this survey. Based on this table, first, we will introduce the traffic datasets commonly used in short-term traffic prediction. Then we will introduce the current mainstream deep learning models for short-term traffic prediction, their advantages and disadvantages in this field. Finally, we will introduce the applications of these models in this field. Table 1 summarizes 32 references that we collected in this survey from 2015 to 2021 using deep learning models to solve short-term traffic prediction problem, most of which are from the top journals of transportation: IEEE Trans. on Intelligent Transportation Systems and Transportation Research Part C: Emerging Technologies. In this table, we record the year of publication, the research team, the types of traffic data predicted, the deep learning models used, the traffic datasets used and whether cosidering the temporal or spatial characteristics of traffic data. Based on Table 1, we summarize the top three datasets most used in the field of short-term traffic prediction: PeMS, Beijing floating car data and Beijing ring road. Table 2 summarizes the collection equipment, traffic parameters, and the usage times in the literature of the above datasets.

Traffic Datasets
As shown in Table 2, PeMS is the most frequently used traffic dataset (11 out of 32). After investigation, it is found that this dataset is a recognized baseline dataset in short-term traffic prediction because of its openness, rich data volume and high data quality. PeMS is collected by more than 39000 detectors deployed on freeways in almost all major metropolitan areas of California. PeMS includes collection time, road section flow, speed and road occupancy. Among them, flow and speed are the traffic parameters that researchers mainly study. PeMS has rich historical traffic data, which contains historical traffic data up to more than ten years. At the same time, researchers can also choose different data aggregation granularity when downloading the data, such as 5 minutes, 1 hour, 1 day, etc., which reduces the burden of researchers on data processing.
In addition to PeMS, it can be seen from Table 2 that the traffic datasets collected in Beijing are the most widely used in the literature. Among them, there are 5 literature using Beijing floating car data and 4 literature using Beijing ring road. However, these two datasets are only a collective term for one type of dataset, not a unified dataset. Beijing floating car data represents the floating car dataset whose moving trajectory is located in Beijing, which is collected by the on-board GPS device, and its basic parameters are collection time, the position of floating car (longitude and latitude) and vehicle speed. Beijing ring road represents the traffic dataset collected by the detector deployed on Beijing ring road, and its basic parameters are collection time, road section flow and speed.
Because it is only a collective term for one type of dataset, Beijing floating car data and Beijing ring road do not have a unified research area and time span. In addition, most of these datasets are private datasets, which are difficult for other researchers to obtain.
In summary, PeMS collected by the road section detectors is currently the most used traffic dataset in short-term traffic prediction, and it is also the baseline dataset in this field. However, for the floating car data collected by the on-board GPS device, there is currently no unified, open, and highly available baseline dataset in this field.

Models
From the Table 1, we can find that RNN and CNN are currently the most widely used models in this field, and almost every research has used them, followed by deep belief network (DBN), stacked autoencoder (SAE), and Attention mechanism module (Attention). Through the research on the models used in the literature, we have made a model taxonomy figure (Figure 1).

CNN
CNN was first proposed and applied in speech recognition [81], and then widely used in image recognition and natural language processing [25,46]. CNN is composed of convolution layer and pooling layer (See Figure 2). Convolution layer uses convolution operations to extract the data features of input data. pooling layer is connected after convolution layer and performs feature extraction and dimensionality reduction.
In short-term traffic prediction, CNN is mainly used to extract the spatial characteristics, but it can also learn the temporal characteristics, which depends on the organization of traffic data. Currently, there are four types of CNNs used in this field: 1D CNN, 2D CNN, 3D CNN and GCN [41]. The difference between them lies in the different convolutions used: 1D CNN uses 1D convolution, 2D CNN uses 2D convolution, 3D CNN uses 3D convolution, and finally, GCN uses graph convolution. In short-term traffic prediction problem, they are applicable to different traffic data formats: 1D CNN and 2D CNN: matrix traffic data, 3D CNN: raster traffic data, GCN: graph structure traffic data. Beijing floating car data GPS speed, location 5 Beijing ring road Detector flow, speed 4

Figure 1
The taxonomy figure of deep learning models currently applied in the field of short-term traffic prediction

CNN
CNN was first proposed and applied in speech recognition [81], and then widely used in image recognition and natural language processing [25,46]. CNN is composed of convolution layer and pooling layer (See Figure 2). Convolution layer uses convolution operations to extract the data features of input data. pooling layer is connected after convolution layer and performs feature extraction and dimensionality reduction.

Figure 2
An example of CNN In short-term traffic prediction, CNN is mainly used to extract the spatial characteristics, but it can also learn the temporal characteristics, which depends on the organization of traffic data. Currently, there are four types of CNNs used in this field: 1D CNN, 2D CNN, 3D CNN and GCN [41]. The difference between them lies in the different convolutions used: 1D CNN uses 1D convolution, 2D CNN uses 2D convolution, 3D CNN uses 3D convolution, and finally, GCN uses graph convolution. In short-term traffic prediction problem, they are applicable to different traffic data formats: 1D CNN and 2D CNN: matrix traffic data, 3D CNN: raster traffic data, GCN: graph structure traffic data.

RNN
RNN has been widely used in popular research fields such as natural language processing [4,38,45] and computer vision [26,15] because of its ability to memorize input sequences. The structure of original RNN is shown in Figure 3.

CNN
CNN was first proposed and applied in speech recognition [81], and then widely used in image recognition and natural language processing [25,46]. CNN is composed of convolution layer and pooling layer (See Figure 2). Convolution layer uses convolution operations to extract the data features of input data. pooling layer is connected after convolution layer and performs feature extraction and dimensionality reduction.

Figure 2
An example of CNN In short-term traffic prediction, CNN is mainly used to extract the spatial characteristics, but it can also learn the temporal characteristics, which depends on the organization of traffic data. Currently, there are four types of CNNs used in this field: 1D CNN, 2D CNN, 3D CNN and GCN [41]. The difference between them lies in the different convolutions used: 1D CNN uses 1D convolution, 2D CNN uses 2D convolution, 3D CNN uses 3D convolution, and finally, GCN uses graph convolution. In short-term traffic prediction problem, they are applicable to different traffic data formats: 1D CNN and 2D CNN: matrix traffic data, 3D CNN: raster traffic data, GCN: graph structure traffic data.

RNN
RNN has been widely used in popular research fields such as natural language processing [4,38,45] and computer vision [26,15] because of its ability to memorize input sequences. The structure of original RNN is shown in Figure 3.

Figure 3
The structure of RNN, X: input data; H: hidden layer; O: output data

RNN
RNN has been widely used in popular research fields such as natural language processing [4,38,45] and computer vision [26,15] because of its ability to memorize input sequences. The structure of original RNN is shown in Figure 3.
The traditional RNN has the problem of gradient vanishing and gradient explosion when dealing with long sequence data [42,27]. Later, the emergence of long short-term memory (LSTM) [28] based on it effectively solved this problem. As shown in Figure  4, LSTM has three gating units: input gate, forgetting gate and output gate, and maintains a cell state that can remember long-term data dependence. This structural feature enables LSTM to learn the longterm characteristics of data well, thereby solving the problems of the original RNN.
The RNNs currently used in short-term traffic prediction mainly include LSTM, gated recurrent unit (GRU), bidirectional long short-term memory (Bi-LSTM), convolutional gated recurrent unit (ConvGRU), convolutional long short-term memory (ConvLSTM) and graph convolutional gated recurrent unit (GCGRU). GRU [10] is a simplified version of LSTM, which has only two gating units. Bi-LSTM [71] is composed of two LSTM units with opposite directions. It can learn The traditional RNN has the problem of gradient vanishing and gradient explosion when dealing with long sequence data [42,27]. Later, the emergence of long short-term memory (LSTM) [28] based on it effectively solved this problem. As shown in Figure 4, LSTM has three gating units: input gate, forgetting gate and output gate, and maintains a cell state that can remember long-term data dependence. This structural feature enables LSTM to learn the long-term characteristics of data well, thereby solving the problems of the original RNN. The structure of LSTM unit with two time steps (t-1, t), i: input gate, f: forget gate, o: output gate, C: cell state.
The RNNs currently used in short-term traffic prediction mainly include LSTM, gated recurrent unit (GRU), bidirectional long shortterm memory (Bi-LSTM), convolutional gated recurrent unit (ConvGRU), convolutional long short-term memory (ConvLSTM) and graph convolutional gated recurrent unit (GCGRU). GRU [10] is a simplified version of LSTM, which has only two gating units. Bi-LSTM [71] is composed of two LSTM units with opposite directions. It can learn the time correlation between the front and back of the time series data. ConvGRU [73] and ConvLSTM [72] are the models obtained by replacing matrix operation in GRU and LSTM with 2D convolution operation respectively, and GCGRU [50] is the model obtained by using graph convolution operation to replace matrix operation in GRU. These RNNs combined with convolution operation can simultaneously extract the time and space characteristics of traffic data.
Both DBN [64] and SAE [6] are stacked by shallow neural network models: DBN is stacked by multiple restricted boltzmann machine (RBM) [70] (Figure 5 (b)), and SAE is stacked by multiple autoencoder (AE) [47] (Figure 6 (b)). RBM is a two-layer neural network model composed of a visible layer and a hidden layer ( Figure 5 (a)). The training goal of RBM is to maximize the product of the probability of the input data based on the energy function [5]. AE is a three-layer neural network model composed of an input layer, a hidden layer and an output layer (Figure 6 (a)). The input data is the same as the learning goal, and the hidden layer used to learn the features of the data is the real output. Its purpose is to obtain more detailed characteristics of the input data. After our investigation, compared with other models, DBN and SAE are commonly used in short-term traffic prediction in the IoV environment. We believe that this is related to their characteristics The structure of LSTM unit with two time steps (t-1, t), i: input gate, f: forget gate, o: output gate, C: cell state the time correlation between the front and back of the time series data. ConvGRU [73] and ConvLSTM [72] are the models obtained by replacing matrix operation in GRU and LSTM with 2D convolution operation respectively, and GCGRU [50] is the model obtained by using graph convolution operation to replace matrix operation in GRU. These RNNs combined with convolution operation can simultaneously extract the time and space characteristics of traffic data.

DBN and SAE
Both DBN [64] and SAE [6] are stacked by shallow neural network models: DBN is stacked by multiple restricted boltzmann machine (RBM) [70] (Figure 5 (b)), and SAE is stacked by multiple autoencoder (AE) [47] (Figure 6 (b)). RBM is a two-layer neural network model composed of a visible layer and a hidden layer ( Figure 5 (a)). The training goal of RBM is to maximize the product of the probability of the input data based on the energy function [5]. AE is a three-layer neural network model composed of an input layer, a hidden layer and an output layer ( Figure 6 (a)). The input data is the same as the learning goal, and the hidden layer used to learn the features of the data is the real output. Its purpose is to obtain more detailed characteristics of the input data. After our investigation, compared with other models, DBN and SAE are commonly used in short-term traffic prediction in the IoV environment. We believe that this is related to their characteristics of less model parameters, simple structure, low training cost and good effect.

Attention Mechanism Module
Attention mechanism refers to the phenomenon that humans will selectively pay attention to the key content of many data, which belongs to the category of cognitive science. Attention mechanism module will imitate human behavior patterns and pay different degrees of attention to the data according to the importance of information, so as to improve the effect of the model in fitting a large amount of data, and then improve the prediction accuracy of the model.
The main type of attention mechanism used in shortterm traffic predition is soft attention mechanism [67], which pays attention to all input data and as-sign them a weight representing the degree of concern. Figure 7 shows an example based on soft attention mechanism. In this figure, the attention weight 1 2 [ , ,..., ] n α α α α = is obtained through the softmax layer, and i a is between (0,1). The larger the value of i a , the higher the attention of the model to its corresponding i x . Finally, α and the input data X are mul-tiplied to obtain the output value y . The introduction of attention mechanism improves the ability of model to pay attention to the data, thereby improving the prediction accuracy of model. In short-term traffic prediction, the attention mechanism is often used as an auxiliary module to play the same role as the above.

Advantages and Disadvantages of These Models for Short-term Traffic Prediction
In this subsection, we summarize the advantages and disadvantages of the above-mentioned deep learning

Advantages and Disadvantages of These Models for Short-term Traffic Prediction
In this subsection, we summarize the advantages and disadvantages of the above-mentioned deep learning models in solving short-term traffic prediction problem. The conclusions are shown in Table 3. For CNN, in short-term traffic prediction, CNN can effectively capture the spatial characteristics of traffic data. However, due to parallel computing and lack of memory units, it is difficult for CNN to capture the long-term temporal characteristics of traffic data. For RNN, the RNN without convolution operation can only capture the temporal characteristics of traffic data, and the RNN with convolution operation can simultaneously capture the temporal and spatial characteristics of traffic data. However, for RNN combined with convolution operations, the spatiotemporal characteristics may not be captured in detail and adequately. Both DBN and SAE can effectively capture the time characteristics of traffic data, and due to their simple structure and low parameter complexity, their training costs are generally lower than other models. However, due to the limitation that they can only receive one-dimensional input data, they cannot effectively capture the spatial characteristics of traffic data. The advantage of attention mechanism module is that it can improve the attention of the model to the important data characteristics of traffic data, so as to improve the prediction accuracy. Its disadvantage is that it does not have the ability to capture the temporal or spatial characteristics of traffic data, so it is unable to complete the task of short-term traffic prediction independently.  Unable to complete the short-term traffic prediction task independently data. However, for RNN combined with convolution operations, the spatiotemporal characteristics may not be captured in detail and adequately. Both DBN and SAE can effectively capture the time characteristics of traffic data, and due to their simple structure and low parameter complexity, their training costs are generally lower than other models. However, due to the limitation that they can only receive one-dimensional input data, they cannot effectively capture the spatial characteristics of traffic data. The advantage of attention mechanism module is that it can improve the attention of the model to the important data characteristics of traffic data, so as to improve the prediction accuracy. Its disadvantage is that it does not have the ability to capture the temporal or spatial characteristics of traffic data, so it is unable to complete the task of short-term traffic prediction independently.

CNN
In the short-term traffic prediction problem, the main application of CNN is to extract the spatial relationship of traffic data, and the main formats of traffic data it faces are matrix data, graph structure data and grid map data. For traffic data in different formats, the CNN used and its role are correspondingly different.

CNN for Matrix Traffic Data
For matrix traffic data, the main CNNs used are 1D CNN and 2D CNN, which are used to extract the time or space characteristics. Y. Liu et al. [55] Use 1D CNN to extract the time correlation of traffic time series data (1D matrix traffic data) from a detector. Wu et al. [85] and Zheng et al. [93] used the same modeling method: the traffic data of different locations are processed into a two-dimensional matrix, each row of the matrix represents the traffic data of a different position, and each column of the matrix represents a time slice. They all used 1D CNN to extract the spatial correlation of this matrix traffic data. Ma, Xiaolei et al. [62] used the same matrix modeling method as the literature [85] and [93], this organization of the matrix enables the 2D CNN to learn the temporal and spatial correlation simultaneously. Dai, Xingyuan et al. [12] use a more complex way to complete the matrix modeling of multi-section traffic data. Then using 2D CNN to mine the spatial correlation between the station to be predicted and its surrounding stations, and use 1x1 2D convolution to extract the time characteristics of traffic data of each station. There are also studies using 2D CNN to extract only the spatial or temporal charac-teristics of the matrix traffic data, which is related to the way the matrix is organized [32,59,60].

CNN for Graph Structure Traffic Data
GCN is often used to extract the spatial relationship between the traffic data of each node in the graph, which is the method used by almost all researches based on graph structure traffic data. H. Qiu et al. [68] used GCN to learn the spatial topological features of graph structure traffic data. Li et al. [52] used GCN to extract the spatial correlation between traffic flow data of 100 stations. Li, Guopeng et al. [49] built a dynamic graph convolution module based on GCN and attention mechanism to capture the spatial correlation of traffic flow between nodes in the road network graph. L. Zhao et al. [91] proposed a hybrid model composed of GCN and GRU. In this model, the purpose of GCN is to learn the complex topology of graph structure traffic data in order to capture spatial correlation. Cui, Zhiyong et al. [61] proposed a new structure based on graph convolution, traffic graph convolution (TGC). TGC can not only learn the spatial correlation among multiple stations of traffic data in graph structure, but also learn the relationship among the attributes of each station.

CNN for Grid Map Traffic Data
Grid map traffic data [87] is a kind of data based on map and grid division. In this data format, the research area will be divided into many grids. Each grid represents an area, and the data in this grid represents the traffic data of this area. For this kind of data, CNN mainly has the following two application methods: the first is to use 2D CNN to extract its spatial correlation, and the temporal correlation is handed over to the RNN connected behind to process [54,63]; the second is to use 3D CNN extracts its temporal and spatial correlation at the same time [24,92]. Since the second method can extract the spatio-temporal correlation simultaneously, it can generally achieve higher prediction accuracy than the first method. However, due to the numerous parameters of 3D CNN, the training cost of the second method is relatively higher.

RNN
At present, the mainstream application of RNN is to act as a component for extracting the time characteristics of data in the combined model. However, from the latest literature [14,17,23], the new model, RNN, that combined with convolution operation is intro-duced into this field. It brought a breakthrough to the application of RNN in this field, because RNN model can extract spatio-temporal correlation.

RNN for Extracting Temporal Characteristics
In this section, the traditional RNN models without convolution are used, such as LSTM, GRU, Bi-LSTM. There are mainly two ways for these RNN models to extract the time characteristics of traffic data: 1) Extract the time characteristics of traffic data as a single prediction model [78,86,69,90]; 2) As a component to extract the time characteristics in the combined model [55,85,93,32,59,60,68,52,11,61,33,8]. The first method is only suitable for short-term traffic prediction problem on a single road section or covering less spatial characteristics. The second method is usually used when the spatial relationship of traffic data is more complicated, only using RNN as the prediction model cannot model and extract the spatial characteristics of traffic data well.

RNN for Extracting Spatiotemporal Characteristics
Due to the introduction of RNN combined with convolution operation, RNN has completed the task of extracting the spatiotemporal characteristics of traffic data. The main data formats these new RNNs faces are graph structure traffic data and grid map traffic data, and the corresponding models are respectively GCGRU; ConvGRU and ConvLSTM. K. Guo et al. [23] used GCGRU to extract spatiotemporal characteristics of graph structured traffic data. Li, Guopeng et al. [49] proposed a new RNN cell based on GCGRU to fully capture the spatiotemporal characteristics of road network traffic speed data. Loan N.N. do et al. [14] used ConvGRU to extract the temporal and spatial characteristics of grid map traffic data simultaneously. To our knowledge, A. E. Essien et al. [17] applied the ConvLSTM to short-term traffic prediction for the first time. Its experimental results prove that ConvLSTM has better prediction results than the connection model of CNN and RNN. T. Jia et al. [34] used ConvLSTM to capture the near temporal and spatial characteristics of road network traffic flow.

DBN and SAE for the Short-term Traffic Prediction Under IoV Enviroment
In the literature we have collected, there are also related studies using DBN or SAE to solve short-term traf-fic prediction problem in traditional environments [36,56,57]. However, their main application in this field is to solve the short-term traffic prediction problem under the IoV environment. Goudarzi et al. [21] applied DBN to short-term traffic prediction in the environment of IoV, they built a predictive model composed of a three-layer DBN for vehicle-to-vehicle communication (V2V) [18][19]. Kong et al. [80] put forward a traffic data preprocessing and prediction scheme for the IoV system, which considered the chaos [81][82] of traffic data. They reconstructed the chaotic traffic data by using phase space reconstruction [39], and then trained and predicted the reconstructed traffic data by DBN. Chen et al. [8] aimed at the problem of large scale and high dimension of traffic data in the IoV environment, used feature engineering and spectral clustering algorithm [3] to compress the large and multi-dimensional road network traffic dataset to about 20% of the original size, and used the combination model composed of SAE and LSTM to train and predict it.

Attention Mechanism Module for Improving Model Prediction Accuracy
In short-term traffic prediction, attention mechanism module is usually used as an auxiliary module to improve the prediction accuracy of the model, which mainly has two application methods. First, only relying on the attention mechanism can improve the ability of model to pay attention to data to improve the prediction accuracy. Kong [43] designed an item-level attention to determine the contribution of the neurons in DBN for more accurate traffic prediction. Li [52] used attention mechanism module in the last layer of the model to helps the model focus on more important features and make the prediction more accurate.
Second, put the attention mechanism in the time characteristic extraction module or spatial characteristic extraction module of the model to better extract the temporal and spatial characteristics of traffic data, thereby improving the model prediction accuracy. Wu, Yuankai [85] used the attention mechanism module to determine the scores of the correlation between the traffic data in the past and the traffic data in the future. Do et al. [14] designed a temporal attention module and a spacial attention module to better capture the time and space correlations. Zheng [93] introduce the attention mechanism for the Conv-LSTM module to automatically explore the importance of traffic data sequences at different times.

Challenges
Compared with traditional mathematical statistics models and machine learning models, deep learning models have higher prediction accuracy in the shortterm traffic prediction, but the development of deep learning models in this field still faces many challenges.

Lack of a Comprehensive Multi-scenario Baseline Dataset
In short-term traffic prediction problem, although there are high availability datasets such as PeMS dataset, there is still a lack of a comprehensive baseline dataset that can cover more traffic scenes, including more traffic data such as floating car data, road network data and some auxiliary data such as weather and POI data [13], etc. The lack of these traffic data in complex scenarios makes the deep learning models applied in this field lack data fuel, which brings huge challenges to the development of deep learning models in solving short-term traffic prediction problem in these scenarios.

Improve the Interpretability of the Model
Model interpretability of deep learning model also plays an important role. Lipton et al. [54] pointed out that the research on the interpretability of deep learning model can improve credibility of the model, better understand the causal relationship between training data and prediction results, and improve the migration ability of the model on different datasets. Undoubtedly, these improvements are of great significance to the development of deep learning model in short-term traffic prediction. However, how to improve the interpretability of the deep learning model and further explore the superiority and rationality of the deep learning model is still a serious challenge.

Processing Multi-source Heterogeneous Data in the Future IoV Environment
According to [37], there are five communication modes in the IoV environment, namely, vehicle-to-vehicle (V2V), vehicle-to-roadside (V2R), vehicle-to-infrastructure of cellular networks (V2I), vehicle-to-personal devices (V2P), and vehicle-to-sensors (V2S), and each mode has corresponding communication protocols, as shown in Table 4. Therefore, the traffic data collected in this environment has many sources, large scale and various forms. The data type used for short-term traffic prediction in IoV environment is still relatively single, however, in the future, the development of IoV will be more mature, and the data in this environment will be more transparent to researchers. Facing the massive multi-source traffic data from users, vehicles and road environment, how to build a suitable deep learning model that can fully tap the value of these traffic data for short-term traffic prediction will become a major challenge in the future.

V2S
Ethernet/Wi-Fi/MOST Table 4 The Communication protocol or method of each Communication mode

Development Trends
In general, the increasingly complex structure of the model is the development trend of deep learning models [77,82] in the field of short-term traffic prediction. For example, the structure of the model gradually changes from a single model to a combined model and from the model that processes one-dimensional traffic time series data to the model that can process two-dimensional and three-dimensional traffic data or graph-structured traffic data [66]. We believe that this development trend is the result of the in-depth research on short-term traffic prediction and the development and improvement of related software and hardware. Although this development trend will bring huge challenges to the establishment and training of the model, with the further deepening of research, we believe that this development trend will continue, and more and more new deep learning models (such as described in [9,16,2,89]) will be introduced into the research field.

Conclusion
This survey collected the recent literature on the use of deep learning models to solve short-term traffic prediction problem. Based on the study of this literature, we introduce the commonly used datasets, mainstream deep learning models and their applications in this field. Then we discuss the development challenges and future trends of deep learning models in short-term traffic prediction.
In this survey, we mainly get the following conclusions. First, CNN and RNN are the most commonly used deep learning models in short-term traffic prediction. CNN is good at capturing the spatial characteristics of traffic data, and RNN is good at capturing the temporal characteristics of traffic data. Secondly, there are many challenges in the development of current and future deep learning models for short-term traffic prediction, such as lack of comprehensive baseline datasets, poor interpretability of model, and capturing data characteristics of multi-source het-erogeneous traffic data. Finally, we find that building a combined deep learning model with increasingly complex structures is the current development trend of deep learning models in this field.
Applying deep learning models to solve the shortterm traffic prediction problem has become the mainstream scheme and research focus in this field. It is hoped that in the future, more and more deep learning models with interpretability, robustness and high prediction accuracy can be proposed to solve the shortterm traffic prediction problem, further promote the development of this field, and help solve urban road congestion and facilitate the travel of urban residents.