Abnormal Human Behavior Detection in Videos: A Review

Modeling human behavior patterns for detecting the abnormal event has become an important domain in recent years. A lot of efforts have been made for building smart video surveillance systems with the purpose of scene analysis and making correct semantic inference from the video moving target. Current approaches have transferred from rule-based to statistical-based methods with the need of efficient recognition of high-level activities. This paper presented not only an update expanding previous related researches, but also a study covered the behavior representation and the event modeling. Especially, we provided a new perspective for event modeling which divided the methods into the following subcategories: modeling normal event, prediction model, query model and deep hybrid model. Finally, we exhibited the available datasets and popular evaluation schemes used for abnormal behavior detection in intelligent video surveillance. More researches will promote the development of abnormal human behavior detection, e.g. deep generative network, weakly-supervised. It is obviously encouraged and dictated by applications of supervising and monitoring in private and public space. The main purpose of this paper is to widely recognize recent available methods and represent the literature in a way of that brings key challenges into notice.


Introduction
In modern intelligent video surveillance systems, the researches mainly focus on: abnormal detection, virtual reality and video stitching [8,100,102], in which the abnormal detection has an important application scenarios and it has raised much attention within the past twenty years [9,82]. Abnormal detection in video is vital to ensure security in both internal spaces with the outside (e.g. campus, waiting halls and shopping malls). With the numbers of cameras have been widely installed, the task of supervising multiple monitors by security staff becomes much more difficult as the human decreased concentration and fatigue. In addition, abnormal events are extremely rare and infrequent that makes the supervision task more difficult and challenging.
Anomaly detection in video processes the parse temporal sequences of object observations to generate high-level descriptions of agent actions and multiagent interactions. Detecting abnormality events require building the complex visual patterns, and some patterns can be learned with the long-term temporal relationship and causal inference [79]. In fact, some previous reviews of intelligent video surveillance systems have been published on the subject of abnormal detection [14,130]. Our survey is relevant but differences reflect in many ways from them. For example, [130] carried out an impressive broad survey on discussion of the creation of intelligent distributed automated surveillance systems. However, these surveys only focus on one perspective of anomaly detection, which are not detailed or systematic enough. Furthermore, with the development of anomaly detection in video surveillance in recent years, the deep generative models which include the variational auto-encoder (VAE) [69], generative adversarial networks (GANs) [49] and other methods has become an important domain. It is necessary to obtain more detailed analysis. We summarize our main contributions as follows: 1 Most of the existing studies either focus on a particular application domain or specific contexts of human activity [2,28]. Our work aims to provide a comprehensive outline of the advance researches in abnormal human behavior detection as well as several applications these techniques are used. 2 Recently, a lot of novel methods for detecting abnormal behavior in videos with excellent performance (e.g. deep learning [60,136]) have been provided. We survey these researches and classify them into organized framework for better understanding and facilitating the reader to view and retrieve the text.
Our work presents an extensive and structured review of anomaly detection technology in video surveillance. This review is structured in five sections.
In Section 2, the definition of anomalies in videos and related surveys are presented. Section 3 and its subsections illustrate the representative approaches. In Section 4, the popular datasets and performance evaluation of previous works are provided. Finally, the conclusions and comments for further research are presented in Section 5.

Definition
Anomalies are also known as abnormalities, discordants or outliers in the data mining and statistics subject. With the difference of the nature of input data (e.g. sequential data: video, voice and protein sequences. non-sequential data: images, age and other data), it has been used for diverse set of tasks (e.g. video surveillance, image analysis, healthcare, sensor networks). However, the anomalies in video surveillance are a little different than the way of data mining and statistics research, which the anomalies in video surveillance needs to take into account the surrounding environment and inference about the type of event in a scene. For example, while it is "normal" to people to walk across a pedestrian walkway during the traffic lights is green, such type of motion activity is viewed as "abnormal" when the traffic lights change to red. Essentially, it needs to provide high semantic level information, which involves specific context, scenes and temporal-spatial information [126,132]. Abnormal behavior detection can be view as a kind of high-level operation of image processing, in which the logical information is extracted from input video data. Fig. 1 shows some different kinds of anomalies in various contexts.
Given this backdrop, the definition of an abnormal event in [147] presented the abnormal as deviating from the normal model. Valantinas et al. [112] defined as events

Related Surveys
Video surveillance, which contains capturing and processing visual data from a scene, to detect objects along time and location for the purpose of cognizing interesting situations, has been attracted more research attention. Several previous surveys and conferences were published (See Table 1). The frequency of publications in the topic of anomaly detection in video (Timespan: 2000.1-2020.9) is shown in Fig. 2. There are lots of studies explain how such technologies can help in social security concerns and monitor of public places. Some surveys have emphasized deep learning based methods [19,71,107]. For example, Ben et al. information is extracted from input video data. Fig. 1 shows some different kinds of anomalies in various contexts. Given this backdrop, the definition of an abnormal event in [147] presented the abnormal as deviating from the normal model. Valantinas et al. [112] defined as events that were unusual and signify irregular behavior. To date, there is no universally accepted definition for abnormal event detection. Among many studies, abnormal behavior can be classified into the following categories: 1 One or more behaviors that are explicitly specified. Such as designating falls as abnormal behavior [68].
2 Abnormal events that deviate qualitatively from what is considered to be normal. Such as only walking is normal in a scene, the running, falling or loitering is regarded as anomaly [29,152].
3 The events happen with a low frequency (probability). Namely they are nature rare, and ature information is extracted from input video data. Fig. 1 shows some different kinds of anomalies in various contexts. Given this backdrop, the definition of an abnormal event in [147] presented the abnormal as deviating from the normal model. Valantinas et al. [112] defined as events that were unusual and signify irregular behavior. To date, there is no universally accepted definition for abnormal event detection. Among many studies, abnormal behavior can be classified into the following categories: 1 One or more behaviors that are explicitly specified. Such as designating falls as abnormal behavior [68].
2 Abnormal events that deviate qualitatively from what is considered to be normal. Such as only walking is normal in a scene, the running, falling or loitering is regarded as anomaly [29,152].
3 The events happen with a low frequency (probability). Namely they are nature rare, alities, g and nature information is extracted from input video data. Fig. 1 shows some different kinds of anomalies in various contexts.

Figure 1
Examples of anomaly event in various contexts Given this backdrop, the definition of an abnormal event in [147] presented the abnormal as deviating from the normal model. Valantinas et al. [112] defined as events that were unusual and signify irregular behavior. To date, there is no universally accepted definition for abnormal event detection. Among many studies, abnormal behavior can be classified into the following categories: 1 One or more behaviors that are explicitly specified. Such as designating falls as abnormal behavior [68].
2 Abnormal events that deviate qualitatively from what is considered to be normal. Such as only walking is normal in a scene, the running, falling or loitering is regarded as anomaly [29,152].
3 The events happen with a low frequency (probability). Namely they are nature rare,  Fig. 1 shows some different kinds of anomalies in various contexts.

Examples of anomaly event in various contexts
Given this backdrop, the definition of an abnormal event in [147] presented the abnormal as deviating from the normal model. Valantinas et al. [112] defined as events that were unusual and signify irregular behavior. To date, there is no universally accepted definition for abnormal event detection. Among many studies, abnormal behavior can be classified into the following categories: 1 One or more behaviors that are explicitly specified. Such as designating falls as abnormal behavior [68].
2 Abnormal events that deviate qualitatively from what is considered to be normal. Such as only walking is normal in a scene, the running, falling or loitering is regarded as anomaly [29,152].
3 The events happen with a low frequency (probability). Namely they are nature rare, information is extracted from input video data. Fig. 1 shows some different kinds of anomalies in various contexts.

Figure 1
Examples of anomaly event in various contexts Given this backdrop, the definition of an abnormal event in [147] presented the abnormal as deviating from the normal model. Valantinas et al. [112] defined as events that were unusual and signify irregular behavior. To date, there is no universally accepted definition for abnormal event detection. Among many studies, abnormal behavior can be classified into the following categories: 1 One or more behaviors that are explicitly specified. Such as designating falls as abnormal behavior [68].
2 Abnormal events that deviate qualitatively from what is considered to be normal. Such as only walking is normal in a scene, the running, falling or loitering is regarded as anomaly [29,152].
3 The events happen with a low frequency (probability). Namely they are nature rare,  [56] Visual surveillance in dynamic behavior [73] Behavior analysis [75] Video event understanding [103] Action classification [66] Human activity recognition [26] Deep learning, video feature representation [5,47] Deep learning, human activity recognition [16] Fixed and moving cameras [98,99] Human behavior detection [14] did a comprehensive review of abnormal behavior recognition, similar to ours, which were grouped into the behavior representation and the behavior modeling. The methods in crowd surveillance videos were surveyed in [2]. Some of the reviews represented problem-or application-specific work, e.g., fixed and moving cameras [16], 2D and 3D approaches [30], components of a surveillance system [129], spatio-temporal interest point [80]. We provide a complete overview of state-of-the-art human behavior detection surveys.
Although a lot of works reviewed in abnormal behavior detection for intelligent video surveillance, there were shortages of comprehensive outline of the advance researches. For instance many researches focus on a particular application domain, e.g. dynamic behavior [56], crowded scence [2], deep learning [26] and so on. However, those methods with spatio-temporal inter-

525
Information Technology and Control 2021/3/50 est point or contextual information are insufficient to describe and detection ongoing human activities with complex structures. In this review, we concentrate on high-level activity recognition methodologies designed for the analysis of abnormal human behavior detection and discuss recent research trends in activity recognition. We wish that our survey bridges this gap.
We use a more detailed taxonomy and compared each approach category. For example, differences between handcrafted features approaches and learned features approaches are discussed in our review. We compare the abilities of event modeling and detection methods within each class as well, pointing out what they are advantages and disadvantages. Furthermore, we discuss the public datasets used by the systems, and compare the different evaluation metrics and performance on the datasets which some previous reviews have not focused on. higher. n of the fter the en into maly if [121].
ing and detect pose of ttracted surveys 1). The nomaly 20.9) is explain security . Some based al. [14] ehavior rouped ehavior eillance reviews ic work, and 3D eillance Although a lot of works reviewed in abnormal behavior detection for intelligent video surveillance, there were shortages of comprehensive outline of the advance researches. For instance many researches focus on a particular application domain, e.g. dynamic behavior [56], crowded scence [2], deep learning [26] and so on. However, those methods with spatio-temporal interest point or contextual information are insufficient to describe and detection ongoing human activities with complex structures. In this review, we concentrate on high-level activity recognition methodologies designed for the analysis of abnormal human behavior detection and discuss recent research trends in activity recognition. We wish that our survey bridges this gap.
We use a more detailed taxonomy and compared each approach category. For example, differences between handcrafted features approaches and learned features approaches are discussed in our review. We

Representative Approaches
In this section, we provide a comprehensive outline of the representative approaches in abnormal behavior detection techniques. The related reasearches mainly involved two stages: Feature extraction (it ranges from handcrafted features and learned features) and quantization, Event modeling and detection. The former is used to obtain the information description of target scene. For the latter, different from other surveys, we tackle the existing modeling approaches with four different categories. It is the key point that we will explain here. It is organized by following structure described in Fig. 3.

Feature Extraction and Quantization
The central issue in feature extraction and quantization level is extracting effective and discriminatory   Taxonomy of existing approaches for abnormal behavior detection

Feature Extraction and Quantization
The central issue in feature extraction and quantization level is extracting effective and discriminatory features to match the volumes, most common representation  [5]. According to the multifarious methods of feature selection, it ranges from handcrafted features and learned features. Handcrafted features imitate the quality of human vision, which are used to distinguish sensitivity between interested and non-interested areas in vision. In general, handcrafted features contain specific physical meaning [145]. For example, scale-invariant feature transform (SIFT) and histogram of oriented gradient (HOG) could reflect the motion variation and shape information of images respectively [155]. Furthermore, a frame is 2-D data formulated by projecting a 3-D real world scenario, and it covers spatial information (e.g., shapes and locations) of video objects. A video is a sequence of the corresponding 2-D frames placed in a timed sequence. In order to better adapt to the high dimensional characteristics of video data, scholars have extended it to the 3-D scale, i.e. 3-D SIFT and 3-D HOG. Learned features (e.g. deep learning) are a set of techniques that allow the detection systems automatically discover the representations needed for detection or classification task from raw data [157]. Most of them combine multiple features to enhance the effects of the feature extraction. Interested application areas in [8], it integrated the Moravec corner point detection and the scale-invariant feature transform (SIFT) feature extractor. Deep learning and spiking neural networks are excellent methods in machine learning fields and so on. Wu  [50,150] N-Grams [146,149] Spatial-temporal Learned features [118] GAN [39,152] Auto-encoder et al. [145] proposed a fast sparse coding network, and the two-stream neural network was used to extract spatial-temporal fusion features (STFF). Spatial-temporal approaches are those that represent features by analyzing the space-time volumes of video data. The most common strategy is constructing a model 3-D space-time volume representing the information on the videos. The most commonly used features for behavior representation are presented in Table 2.

Event Modeling and Detection
With the development of abnormal detection research, the methods of event modeling and detection have transferred from rule-based to statistical-based [160]. The rule-based methods are used to assign certain behaviors or building model as abnormal. It can be quite valid in situations where normal behaviours are well-defined and constrained. However, in real world video data, the number of different normal behaviors category can easily surpass what are considered as suspicious [23]. Statistical based methods are sufficient for learning the statistical properties of behavior pattern and they are in favor of describing suspicion events. Hence, the statistical-based methods are preferred in our research. Fast and accurate abnormal event detection is greatly valuable in a large number of scenarios. The core is recognizing the type of behavior performance by the target in video. Accordingly, we devide the event modeling and detection approaches into several categories: modeling normal event, prediction model, query model, and deep hybrid model. The details are described as follows.

Modeling Normal Event
In anomaly detection, the idea of event modeling for normal training dataset is a generally adopted device. At this phase, extracting features of training dataset from normal events are used to build a normal event model, which scholars deal with it as a pattern learning problem [117]. Specifically, finding comfortable matches with priori behavior pattern (template), or learning statistical models of the behaviors in video. The sketch of common modeling normal event scheme is shown in Fig. 4. Those methods can be categorized into reconstruction-, domain-probabilistic-and distance-based methods. Some excellent algorithms are presented in Table 3. methods are used to compute the deviation between the test data and the normal patterns which are similar outlier detection problem [18]. Calculating the deviation of a behavior pattern from the other could be done in different ways [146]. For example, Li et al. [79] used a joint feature representation, and a hierarchy of mixture of dynamic textures models. Results show that the method achieves the anticipated goal when compared with the state- The sketch of common modeling normal event in video sequences  Fig. 4. Those methods can be categorized into reconstruction-, domainprobabilistic-and distance-based methods. Some excellent algorithms are presented in Table 3.

Figure 4
The sketch of common modeling normal event in video sequences  Table 3 The common methods in modeling normal event

Reconstruction-based.
Reconstruction-based methods are used to compute the deviation between the test data and the normal patterns which are similar outlier detection problem [18]. Calculating the deviation of a behavior pattern from the other could be done in different ways [146]. For example, Li et al. [79] used a joint feature representation, and a hierarchy of mixture of Specifically, sparse coding method input video data X={x1, x2, ..., xn} over-completed dictionary D∈R constructed using the normal data, w k, k is the basis number of dictio optimization function is where A={a1, a2, …, am}∈R d × k is t representation of X. We can use t linear combination of the basis in d D to reconstruction the test data y∈ where a*∈R d is the reconstruction w data y, and the cost of the recon coefficient y is Sparse coding are mainly divided base dictionary (e.g. wavelets, the cosine transform, DCT [35]) and dictionary (e.g. generalized component analysis (PCA); the m optimal directions (MOD) [37]; singular value decomposition (K-S The method with fixed base dictiona be the best match by analyzing the features of images, so the noise cond difficult to extract features effecti latter can generate different dictio different types of signals and it h adaptive ability. Li et al. [78] intr trajectory sparse reconstruction (SRA). The defect of this method i detection performance influenced control point parameter. Luo et al. [ temporally coherent sparse coding However, the dictionary is trained normal events and it is genera complete, which cannot ens of-the art. For the sudden illumination changes, Cermeno et al. [18] matched the label scenes with previously learned examples. Several previous works are focused on finding a group of basis to represent normal data and recognize data with high reconstruction error, e.g., sparse coding [64,74], auto-encoder (AE) [63]. Specifically, sparse coding methods, for an input video data X={x 1 , x 2 , ..., x n }, and the over-completed dictionary DÎR d×k are constructed using the normal data, where d ≪ k, k is the basis number of dictionary. The optimization function is The sketch of common modeling normal event scheme is shown in Fig. 4. Those methods can be categorized into reconstruction-, domainprobabilistic-and distance-based methods. Some excellent algorithms are presented in Table 3.

Figure 4
The sketch of common modeling normal event in video sequences

Extract Features Training Videos
New Videos Feature description Behavior modeling Abnormal Detection Normal Table 3 The common methods in modeling normal event

Reconstruction-based.
Reconstruction-based methods are used to compute the deviation between the test data and the normal patterns which are similar outlier detection problem [18]. Calculating the deviation of a behavior pattern from the other could be done in different ways [146]. For example, Li et al. [79] used a joint feature representation, and a hierarchy of mixture of dynamic textures models. Results show that the method achieves the anticipated goal when sparse coding [64,74], auto-encoder (AE) [63]. Specifically, sparse coding methods, for an input video data X={x1, x2, ..., xn}, and the over-completed dictionary D∈R d × k are constructed using the normal data, where d ≪ k, k is the basis number of dictionary. The optimization function is where A={a1, a2, …, am}∈R d × k is the sparse representation of X. We can use the sparse linear combination of the basis in dictionary D to reconstruction the test data y∈R d where a*∈R d is the reconstruction weights of data y, and the cost of the reconstruction coefficient y is Sparse coding are mainly divided into fixed base dictionary (e.g. wavelets, the discrete cosine transform, DCT [35]) and learning dictionary (e.g. generalized principal component analysis (PCA); the method of optimal directions (MOD) [37]; and Ksingular value decomposition (K-SVD) [64]. The method with fixed base dictionary cannot be the best match by analyzing the structure features of images, so the noise conditions are difficult to extract features effectively. The latter can generate different dictionaries for different types of signals and it has strong adaptive ability. Li et al. [78] introduced a trajectory sparse reconstruction analysis (SRA). The defect of this method is that the detection performance influenced by the control point parameter. Luo et al. [86] built a temporally coherent sparse coding method. However, the dictionary is trained with only normal events and it is generally overcomplete, which cannot ensure the expectation.
Thanks to deep learning methods, recent , (1) where A = {a 1 , a 2 , …, a m }ÎR d×k is the sparse representation of X. We can use the sparse linear combination of the basis in dictionary D to reconstruction the test data yÎR d The sketch of common modeling normal event scheme is shown in Fig. 4. Those methods can be categorized into reconstruction-, domainprobabilistic-and distance-based methods. Some excellent algorithms are presented in Table 3.

Figure 4
The sketch of common modeling normal event in video sequences

Extract Features Training Videos
New Videos Feature description Behavior modeling Abnormal Detection Normal Table 3 The common methods in modeling normal event

Reconstruction-based.
Reconstruction-based methods are used to compute the deviation between the test data and the normal patterns which are similar outlier detection problem [18]. Calculating the deviation of a behavior pattern from the other could be done in different ways [146]. For example, Li et al. [79] used a joint feature representation, and a hierarchy of mixture of dynamic textures models. Results show that the method achieves the anticipated goal when sparse coding [64,74], auto-encoder (AE) [63]. Specifically, sparse coding methods, for an input video data X={x1, x2, ..., xn}, and the over-completed dictionary D∈R d × k are constructed using the normal data, where d ≪ k, k is the basis number of dictionary. The optimization function is where A={a1, a2, …, am}∈R d × k is the sparse representation of X. We can use the sparse linear combination of the basis in dictionary D to reconstruction the test data y∈R d where a*∈R d is the reconstruction weights of data y, and the cost of the reconstruction coefficient y is Sparse coding are mainly divided into fixed base dictionary (e.g. wavelets, the discrete cosine transform, DCT [35]) and learning dictionary (e.g. generalized principal component analysis (PCA); the method of optimal directions (MOD) [37]; and Ksingular value decomposition (K-SVD) [64]. The method with fixed base dictionary cannot be the best match by analyzing the structure features of images, so the noise conditions are difficult to extract features effectively. The latter can generate different dictionaries for different types of signals and it has strong adaptive ability. Li et al. [78] introduced a trajectory sparse reconstruction analysis (SRA). The defect of this method is that the detection performance influenced by the control point parameter. Luo et al. [86] built a temporally coherent sparse coding method. However, the dictionary is trained with only normal events and it is generally overcomplete, which cannot ensure the expectation.
Thanks to deep learning methods, recent , (2) where a*ÎR d is the reconstruction weights of data y, and the cost of the reconstruction coefficient y is The sketch of common modeling normal event scheme is shown in Fig. 4. Those methods can be categorized into reconstruction-, domainprobabilistic-and distance-based methods. Some excellent algorithms are presented in Table 3.

Figure 4
The sketch of common modeling normal event in video sequences  Table 3 The common methods in modeling normal event

Reconstruction-based.
Reconstruction-based methods are used to compute the deviation between the test data and the normal patterns which are similar outlier detection problem [18]. Calculating the deviation of a behavior pattern from the other could be done in different ways [146]. For example, Li et al. [79] used a joint feature representation, and a hierarchy of mixture of dynamic textures models. Results show that the method achieves the anticipated goal when compared with the state-of-the art. For the sudden sparse coding [64,74], auto-encoder (AE) [63]. Specifically, sparse coding methods, for an input video data X={x1, x2, ..., xn}, and the over-completed dictionary D∈R d × k are constructed using the normal data, where d ≪ k, k is the basis number of dictionary. The optimization function is where A={a1, a2, …, am}∈R d × k is the sparse representation of X. We can use the sparse linear combination of the basis in dictionary D to reconstruction the test data y∈R d where a*∈R d is the reconstruction weights of data y, and the cost of the reconstruction coefficient y is Sparse coding are mainly divided into fixed base dictionary (e.g. wavelets, the discrete cosine transform, DCT [35]) and learning dictionary (e.g. generalized principal component analysis (PCA); the method of optimal directions (MOD) [37]; and Ksingular value decomposition (K-SVD) [64]. The method with fixed base dictionary cannot be the best match by analyzing the structure features of images, so the noise conditions are difficult to extract features effectively. The latter can generate different dictionaries for different types of signals and it has strong adaptive ability. Li et al. [78] introduced a trajectory sparse reconstruction analysis (SRA). The defect of this method is that the detection performance influenced by the control point parameter. Luo et al. [86] built a temporally coherent sparse coding method. However, the dictionary is trained with only normal events and it is generally overcomplete, which cannot ensure the expectation.
Thanks to deep learning methods, recent researches are able to make the best of large-. (3) Sparse coding are mainly divided into fixed base dictionary (e.g. wavelets, the discrete cosine transform, DCT [35]) and learning dictionary (e.g. generalized principal component analysis (PCA); the method of optimal directions (MOD) [37]; and K-singular value decomposition (K-SVD) [64]. The method with fixed base dictionary cannot be the best match by analyzing the structure features of images, so the noise conditions are difficult to extract features effectively. The latter can generate different dictionaries for different types of signals and it has strong adaptive ability. Li et al. [78] introduced a trajectory sparse reconstruction analysis (SRA). The defect of this method is that the detection performance influenced by the control point parameter. Luo et al. [86] built a temporally coherent sparse coding method. However, the dictionary is trained with only normal events and it is generally over-complete, which cannot ensure the expectation. Thanks to deep learning methods, recent researches are able to make the best of large-scale datasets and powerful computing resource. Deep auto-encoder (AE) is widely used for training encoding-decoding neural networks by minimizing the reconstruction errors [85,142]. For an input xÎR d and the corresponding output y i , it is usually trained to reconstruct the training model at the output of the network and to minimize the reconstruction error e corresponding output yi, it is usually trained to reconstruct the training model at the output of the network and to minimize the reconstruction error e (1) However, it ignored the additional constraints and the mapping is the identity [48,51]. Some works used the denoising AEs to circumvent such limitation [136,94]. They also use deep architectures to learn a compressed representation for the training data, by reducing the number of hidden units [60]. Such operation may ignore the 2-D structure in videos and the features result in redundancy in the parameters of the network. To cope with this issue, convolutional AE (CAE) architecture is proposed which the weights are communion among all spatial position in the input [95]. The loss function is shown as: where λ is the parameter for the regularization Other relates works are proposed with the 3-D information to analysis both temporal and spatiotemporal irregularities in videos [152]. Wang et al. [142] combined deep AE network with 3-D CNN to model the spatio-temporal information in videos. Most of the AEs based researches was built to train the spatiotemporal representation in videos [114]. However, spatiotemporal irregularities are difficult to analysis in the video frames, as they are commonly not properly defined and do not occur frequently in videos. Khan et al. [68] used an adversarial learning framework which contained the spatio-temporal AE and spatio-temporal convolution network. Yan et al. [153] worked on a two-stream recurrent VAE and each stream could achieve modeling the probabilistic distribution of the normal samples by the recurrent VAE. However, it failed in pixel-level detection scheme. Deep learning-based methods make a breakthrough in anomaly detection by employing deep features in reconstruction. Song et al. [124], the authors introduced an AE combine with attention model to build normal patterns. For solving the video sequences containing images that campus scene, riding a bike is novel behavior since it has not appeared in normal model. However, it should not be treated as an anomaly. Rather than computing the deviation, Xu et al. [149] proposed an adaptive intra-frame classification network in which the one-class deviation problem was translated into a multi-class classification problem. Yong et al. [154] put forward a deep neural network (DNN) and Zhao et al. [159] proposed a spatio-temporal AE network.
We note that Markov random field (MRF), Gaussian mixture model (GMM) and hidden Markov model (HMM) are also widely used for anomaly detection in videos [61,96]. Take GMM for example, given the normal data could be linked with at least one Gaussian component of Gaussian mixture model, while the abnormal data could not belong to any Gaussian component. It is usually to train a regression model using the training data. When new video data are mapped to the regression model, the reconstruction error is regarded as the abnormality score [39]. It commonly can be divided into three steps [42]: 1. Select a Gaussian mixture G~Category (π); 2. Obtain a latent vector v~N( G, G 2 ); 3. Calculate the reconstruction result ', (v, G) , where k is the number of components of the mixture, π is the prior probability, N( , 2 ) is Gaussian distribution parameterized by means and covariance 2 . It relies on observation variables and tries to retrieve those observation variables which are already fixed at beginning. On the other hand, variants of GMM like Dirichlet based mixture GMM models [61] and adaptive GMM [133] do not just depend on observations and pertain to longer interaction between observations. To alleviate the shortcomings of GMM, a deep GMM is used in [42].
_ Domain-based. Domain-based methods commonly state as a region of the normal videos, based on the distributed character of video data, to describe the domain of the normal sample [72]. One favorite tool in (4) However, it ignored the additional constraints and the mapping is the identity [48,51]. Some works used the denoising AEs to circumvent such limitation [136,94]. They also use deep architectures to learn a compressed representation for the training data, by reducing the number of hidden units [60]. Such operation may ignore the 2-D structure in videos and the features result in redundancy in the parameters of the network. To cope with this issue, convolutional AE (CAE) architecture is proposed which the weights are communion among all spatial position in the input [95]. The loss function is shown as: corresponding output yi, it is usually trained to reconstruct the training model at the output of the network and to minimize the reconstruction error e However, it ignored the additional constraints and the mapping is the identity [48,51]. Some works used the denoising AEs to circumvent such limitation [136,94]. They also use deep architectures to learn a compressed representation for the training data, by reducing the number of hidden units [60]. Such operation may ignore the 2-D structure in videos and the features result in redundancy in the parameters of the network. To cope with this issue, convolutional AE (CAE) architecture is proposed which the weights are communion among all spatial position in the input [95]. The loss function is shown as: where λ is the parameter for the regularization term 2 2 || || W .
Other relates works are proposed with the 3-D information to analysis both temporal and spatiotemporal irregularities in videos [152]. Wang et al. [142] combined deep AE network with 3-D CNN to model the spatio-temporal information in videos. Most of the AEs based researches was built to train the spatiotemporal representation in videos [114]. However, spatiotemporal irregularities are difficult to analysis in the video frames, as they are commonly not properly defined and do not occur frequently in videos. Khan et al. [68] used an adversarial learning framework which contained the spatio-temporal AE and spatio-temporal convolution network. Yan et al. [153] worked on a two-stream recurrent VAE and each stream could achieve modeling the probabilistic distribution of the normal samples by the recurrent VAE. However, it failed in pixel-level detection scheme. Deep learning-based methods make a breakthrough in anomaly detection by employing deep features in reconstruction. Song et al. [124], the authors introduced an AE combine with attention model to build normal patterns. For solving the video sequences containing images that campus scene, riding a bike is novel behavior since it has not appeared in normal model. However, it should not be treated as an anomaly. Rather than computing the deviation, Xu et al. [149] proposed an adaptive intra-frame classification network in which the one-class deviation problem was translated into a multi-class classification problem. Yong et al. [154] put forward a deep neural network (DNN) and Zhao et al. [159] proposed a spatio-temporal AE network.
We note that Markov random field (MRF), Gaussian mixture model (GMM) and hidden Markov model (HMM) are also widely used for anomaly detection in videos [61,96]. Take GMM for example, given the normal data could be linked with at least one Gaussian component of Gaussian mixture model, while the abnormal data could not belong to any Gaussian component. It is usually to train a regression model using the training data. When new video data are mapped to the regression model, the reconstruction error is regarded as the abnormality score [39]. It commonly can be divided into three steps [42]: 1. Select a Gaussian mixture G~Category (π); 2. Obtain a latent vector v~N( G, G 2 ); 3. Calculate the reconstruction result ', (v, G) , where k is the number of components of the mixture, π is the prior probability, N( , 2 ) is Gaussian distribution parameterized by means and covariance 2 . It relies on observation variables and tries to retrieve those observation variables which are already fixed at beginning. On the other hand, variants of GMM like Dirichlet based mixture GMM models [61] and adaptive GMM [133] do not just depend on observations and pertain to longer interaction between observations. To alleviate the shortcomings of GMM, a deep GMM is used in [42].
_ Domain-based. Domain-based methods commonly state as a region of the normal videos, based on the distributed character of video data, to describe the domain of the normal sample [72]. One favorite tool in , (5) where λ is the parameter for the regularization term (1) the additional constraints and identity [48,51]. Some works g AEs to circumvent such 94]. They also use deep n a compressed representation a, by reducing the number of uch operation may ignore the eos and the features result in parameters of the network. To ue, convolutional AE (CAE) osed which the weights are all spatial position in the input n is shown as: ameter for the regularization s are proposed with the 3-D alysis both temporal and ularities in videos [152]. Wang d deep AE network with 3-D spatio-temporal information in AEs based researches was built iotemporal representation in However, spatiotemporal fficult to analysis in the video re commonly not properly t occur frequently in videos. sed an adversarial learning contained the spatio-temporal oral convolution network. Yan n a two-stream recurrent VAE could achieve modeling the tion of the normal samples by owever, it failed in pixel-level Deep learning-based methods gh in anomaly detection by tures in reconstruction. Song et campus scene, riding a bike is novel behavior since it has not appeared in normal model. However, it should not be treated as an anomaly. Rather than computing the deviation, Xu et al. [149] proposed an adaptive intra-frame classification network in which the one-class deviation problem was translated into a multi-class classification problem. Yong et al. [154] put forward a deep neural network (DNN) and Zhao et al. [159] proposed a spatio-temporal AE network.
We note that Markov random field (MRF), Gaussian mixture model (GMM) and hidden Markov model (HMM) are also widely used for anomaly detection in videos [61,96]. Take GMM for example, given the normal data could be linked with at least one Gaussian component of Gaussian mixture model, while the abnormal data could not belong to any Gaussian component. It is usually to train a regression model using the training data. When new video data are mapped to the regression model, the reconstruction error is regarded as the abnormality score [39]. It commonly can be divided into three steps [42]: 1. Select a Gaussian mixture G~Category (π); 2. Obtain a latent vector v~N( G, G 2 ); 3. Calculate the reconstruction result ', (v, G) , where k is the number of components of the mixture, π is the prior probability, N( , 2 ) is Gaussian distribution parameterized by means and covariance 2 . It relies on observation variables and tries to retrieve those observation variables which are already fixed at beginning. On the other hand, variants of GMM like Dirichlet based mixture GMM models [61] and adaptive GMM [133] do not just depend on observations and pertain to longer interaction between observations. To alleviate the shortcomings of GMM, a deep GMM is used in [42].
_ Domain-based. Domain-based methods commonly state as a region of the normal . Other relates works are proposed with the 3-D information to analysis both temporal and spatiotemporal irregularities in videos [152]. Wang et al. [142] combined deep AE network with 3-D CNN to model the spatio-temporal information in videos. Most of the AEs based researches was built to train the spatiotemporal representation in videos [114]. However, spatiotemporal irregularities are difficult to analysis in the video frames, as they are commonly not properly defined and do not occur frequently in videos. Khan et al. [68] used an adversarial learning framework which contained the spatio-temporal AE and spatio-temporal convolution network. Yan et al. [153] worked on a two-stream recurrent VAE and each stream could achieve modeling the probabilistic distribution of the normal samples by the recurrent VAE. However, it failed in pixel-level detection scheme.
Deep learning-based methods make a breakthrough in anomaly detection by employing deep features in reconstruction. Song et al. [124], the authors introduced an AE combine with attention model to build normal patterns. For solving the video sequences containing images that were never seen before, Slavic et al. [123] introduced a VAE in videos. However, it is hard to assure the anomaly data with a larger reconstruction error because of the strong extensive ability of neural networks. Suppose the novel behaviors as anomaly is probably one-sided for practical surveillance applications. Such as the campus scene, riding a bike is novel behavior since it has not appeared in normal model. However, it should not be treated as an anomaly. Rather than computing the deviation, Xu et al. [149] proposed an adaptive intra-frame classification network in which the one-class deviation problem was translated into a multi-class classification problem. Yong et al. [154] put forward a deep neural network (DNN) and Zhao et al. [159] proposed a spatio-temporal AE network.
We note that Markov random field (MRF), Gaussian mixture model (GMM) and hidden Markov model (HMM) are also widely used for anomaly detection in videos [61,96]. Take GMM for example, given the normal data could be linked with at least one Gaussian component of Gaussian mixture model, while the abnormal data could not belong to any Gaussian component. It is usually to train a regression model using the training data. When new video data are mapped to the regression model, the reconstruction error is regarded as the abnormality score [39]. It commonly can be divided into three steps [42]: 1. Select a Gaussian mixture G~Category ( ); 2. Obtain a latent vector v~N( G , G 2 ); 3. Calculate the reconstruction result ', (v, G) = G N (v| G , G 2 ), [ ; log 2 ] = (v; ), '∼N( , 2 ), where k is the number of components of the mixture, is the prior probability, N( , 2 ) is Gaussian distribution parameterized by means and covariance 2 . It relies on observation variables and tries to retrieve those observation variables which are already fixed at beginning. On the other hand, variants of GMM like Dirichlet based mixture GMM models [61] and adaptive GMM [133] do not just depend on observations and pertain to longer interaction between observations. To alleviate the shortcomings of GMM, a deep GMM is used in [42]. state as a region of the normal videos, based on the distributed character of video data, to describe the domain of the normal sample [72]. One favorite tool in domain-based method is support vector machines (SVMs). It is a typical algorithm for forming a margin's boundary. The ideal hyperplane is to represent the largest separation (or margin) between different classes [93]. The partitions between classes of normal activities have also been learned using kernel SVM [104,120]. Most of them are to define a boundary for the normal samples according to the structure of the training data which are used to repress the domain of the normal categorize. Similarity, kernel one-class SVM is an efficient tool for abnormal behavior recognition [137], and it also be extended to nonlinear kernel form [139]. What is remarkable here is the complexity corresponding to the calculation of the kernel functions.
Another way is considering the different depths of field for the same scene. Specifically, one object is closer to the video surveillance, greater movement will be detected, and when it is far from the video surveillance, the detected movement will be smaller. This problem is illustrated in Fig. 5, the pedestrian in the red rectangle is moving at a constant speed. However, the detected speeds in Figures 5(a)-5(b) are different which may lead to different motion patterns of objects. To address this problem, several previous works built block-wise modeling and trained normal event model for each block [88,141]. For example, Cong et al. [27] proposed a novel feature descriptor named multi-scale histogram of optical flow (HOF) which partitioned the frame into a few basic units. Although the block-wise based methods achieve excellent performance, they may lead to another problem: the resolution of a frame is M×N, and dividing into n×n blocks with k pixels overlapping, so there will produce {(M-k)/(n-k)+1]}×{(N-k)/(n-k)+1]} features.
To training large numbers of data will cause expensive time cost and waste storage space. In order to overcome the limitations of the above methods, Shi et al. [117] built a coarse scale and region-wise method, in which the blocks with similar depth were incorporated into one region and shared one normal-events model. However, the performance of those works not satisfactory for real-world surveillance applications due to the limitation of few samples and computation.
_ Probabilistic-based. The methods of probabilisticbased are commonly used based on the amount of information that test video should contain, match videos by modeling a probability distribution, and the event not match to the normal pattern is judged to be an abnormal event. For an input x Î R d and the corresponding P x (x) is the distribution of normal data X. Hypothesis testing model H 0 : x complies with the probability distribution P x (x); H 1 : x complies with the outside probability distribution P x (x) and when the P x (x) < α, refused H 0 , and accept H 1 , α is a standardized parameters for unknown distributions.
Different probabilities of behaviors cause different attention in many fields of view [151]. To identify the events importance from the point of probabilistic, She et al. [115] proposed a semantics analysis tool of message importance measure (MIM). The MIM is a generalized Shannon information theory.
to define a boundary for the normal samples according to the structure of the training data which are used to repress the domain of the normal categorize. Similarity, kernel one-class SVM is an efficient tool for abnormal behavior recognition [137], and it also be extended to nonlinear kernel form [139]. What is remarkable here is the complexity corresponding to the calculation of the kernel functions.
Another way is considering the different depths of field for the same scene. Specifically, one object is closer to the video surveillance, greater movement will be detected, and when it is far from the video surveillance, the detected movement will be smaller. This problem is illustrated in Fig. 5, the pedestrian in the red rectangle is moving at a constant speed. However, the detected speeds in Figures 5(a)-5(b) are different which may lead to different motion patterns of objects. To address this problem, several previous works built blockwise modeling and trained normal event model for each block [88,141]. For example, Cong et al. [27] proposed a novel feature descriptor named multiscale histogram of optical flow (HOF) which partitioned the frame into a few basic units. Although the block-wise based methods achieve excellent performance, they may lead to another problem: the resolution of a frame is M×N, and judged to be an abnormal event. For an input d x R ∈ and the corresponding Px(x) is the distribution of normal data X. Hypothesis testing model H0: x complies with the probability distribution Px(x); H1: x complies with the outside probability distribution Px(x) and when the Px(x) < α, refused H0, and accept H1, α is a standardized parameters for unknown distributions. [151]. To identify the events importance from the point of probabilistic, She et al. [115] proposed a semantics analysis tool of message importance measure (MIM). The MIM is a generalized Shannon information theory.

Different probabilities of behaviors cause different attention in many fields of view
where the p(xi) is a discrete distribution for data x, w is an adjustable parameter. Motivated by the idea of probabilistic method, Garcia et al. [46] used a GMM to add new behaviors appearing in the environment. It is noted that probabilistic methods approximated the probability density of the normal samples and detected whether a new data comes from the similarly distribution or , (6) where the p(x i ) is a discrete distribution for data x, w is an adjustable parameter. Motivated by the idea of The movement inconsistency of the same object in different locations of those works not satisfactory fo surveillance applications due to the few samples and computation.

Figure 5
The movement inconsistency of the sa different locations (a) (b _ Probabilistic-based. The methods of based are commonly used based on th information that test video should co videos by modeling a probability dist the event not match to the norma limitations of the above methods, Shi et al. [117] built a coarse scale and region-wise method, in which the blocks with similar depth were incorporated into one region and shared one normal-events model. However, the performance of those works not satisfactory for real-world surveillance applications due to the limitation of few samples and computation.

Figure 5
The movement inconsistency of the same object in different locations It is noted that probabilistic methods approximated the probability density of the normal samples and detected whether a new data comes from the similarly distribution or not. Those methods of judging the potential data density branch out into parametric and no parametric.
The parametric approaches presume the normal samples are created from an underlying parametric distribution, and the concerned parameters of the distribution are measured from the normal data. For example, Yamanaka et al. [151] adopted a binary feature of auto-encoding for detection and it was a low-complexity probabilistic models. Hou et al. [55] introduced the Bayesian hierarchical method to achieve detection. For probabilistic models, anomalous data can be defined as datasets that lie in low density or concentration regions of the domain of an input training distribution, such as probabilistic topic method [70] and hierarchical probabilistic model [4]. However, those methods will cause a larger error when the data do not satisfied the assumed distribution. The no parametric approaches are suitable for above situation with no need for making assumptions, and it has been expand to fit the complexity of video data. This ideology has been used in [156]. Some previous studies analyzed the event in videos based on its trajectory [157,52], Sadeghi et al. [109] extended it to an on-line based method. Other research utilized the contextual information [110,144]. Wang et al. [138] proposed a motion information coding algorithm based on image descriptors. It is most likely that developing rules and probability distribution describing behaviors for a complex context would be a difficult task. _ Distance-based.
Due to the small types of normal events and similar characteristics, while the types of abnormal events are numerous, some scholars believe that under a certain characteristic space, the distribution of normal events is closely in the feature space and it is distinguishable from abnormal events. The distance-based methods need to satisfy the assumption: the normal samples belong to a number of large and closely cluster while the abnormal samples are the opposite [17,87]. The detection is performed by matching the input data of them. To better build the data distribution of normal samples, Chang et al. [21] reduced the dis-tance between video data and the hidden vectors. Ma et al. [87], the authors the authors used a trajectory distance metrics based on recurrent neural network (RNN) to measure similarities and detected anomalies from trajectory data. Actually, most methods approach the abnormal score is the measure of distance of data from center of the sphere, the data points which are far away from center are regarded as anomalous, such as robust invariant distance measures [17] and distance between the cluster centers and a feature vector [11]. The above methods do without a priori knowledge of the data distribution and can work well for the problem of noisy features. These approaches assume the normal data are clustered together, while the abnormal data do the reverse. Based on the motion trajectory of multiple pedestrians, Guo et al. [119] extracted both distance and relative speed between trajectories, and the detection results were based on the spatial relationship. Lin et al. [81] extended it as an online weighted clustering algorithm. Chang et al. [22] proposed a novel clustering-driven deep AE method and both the reconstruction error and the cluster distance were used to evaluate the anomaly. In these methods, only motion or appearance features are used to perform clustering, which make them would not work for complex scenes, such as disorganized motion directions and over-speed objects. In Table 4, we describe and compare the four categories of modeling normal event methods.

Prediction model
Prediction methods are used to detect anomaly events by comparing them with their expectation. Learning to future frame prediction in videos includes the building of an inner representation that simulates the image evolution precisely. Thus, in a way, it is content and dynamic. One of the typical representative works was published in [82]. They search the best match for the prediction data and determined how abnormal it is. The main contribution of this paper is combining the appearance constraint and motion constraint with intensity gradient loss and optical flow loss, respectively (see Fig. 6).
Another popular method is long short term memory (LSTM) AE model and it is similar to the future predictor model [128]. The framework of them are show in Fig. 7. The LSTM AE model consists of the encoder LSTM (read the input feature vector) and the decoder LSTM (output the prediction for the feature vector 531 Information Technology and Control 2021/3/50

Figure 6
The pipeine of video frame prediction network

Figure 7
Overview of the LSTM AE model (a) and LSTM future predictor model (b)

Figure 8
The framework of composite LSTM model [125] normal ve that e, the in the e from ethods amples cluster site [17, ing the e data al. [21] and the ors the ased on easure ajectory ch the of data which ed as istance cluster above he data blem of e the ile the motion l. [119] etween e based tended orithm. steringh the ce were the appearance constraint and motion constraint with intensity gradient loss and optical flow loss, respectively (see Fig. 6). Figure 6 The pipeine of video frame prediction network Another popular method is long short term memory (LSTM) AE model and it is similar to the future predictor model [128]. The framework of them are show in Fig. 7. The LSTM AE model consists of the encoder LSTM (read the input feature vector) and the decoder LSTM (output the prediction for the feature vector with reverse order). The future predictor model is similar to LSTM AE model, but the decoder LSTM predicts frames that come after the input feature vector. For example, Villegas et al. [134] built a natural video sequence prediction method based on the encoder-decoder CNN and ConvLSTM.
(a) LSTM AE model Another popular method is long short term memory (LSTM) AE model and it is similar to the future predictor model [128]. The framework of them are show in Fig. 7. The LSTM AE model consists of the encoder LSTM (read the input feature vector) and the decoder LSTM (output the prediction for the feature vector with reverse order). The future predictor model is similar to LSTM AE model, but the decoder LSTM predicts frames that come after the input feature vector. For example, Villegas et al. [134] built a natural video sequence prediction method based on the encoder-decoder CNN and ConvLSTM. Subsequently, Srivastava et al. [125] Another popular method is long short term memory (LSTM) AE model and it is similar to the future predictor model [128]. The framework of them are show in Fig. 7. The LSTM AE model consists of the encoder LSTM (read the input feature vector) and the decoder LSTM (output the prediction for the feature vector with reverse order). The future predictor model is similar to LSTM AE model, but the decoder LSTM predicts frames that come after the input feature vector. For example, Villegas et al. [134] built a natural video sequence prediction method based on the encoder-decoder CNN and ConvLSTM. Subsequently, Srivastava et al. [125] with reverse order). The future predictor model is similar to LSTM AE model, but the decoder LSTM predicts frames that come after the input feature vector. For example, Villegas et al. [134] built a natural video sequence prediction method based on the encoder-decoder CNN and ConvLSTM.
Subsequently, Srivastava et al. [125] combined the reconstructing the input and predicting the future to create a composite model. The framework of them are show in Fig. 8. However, the precision of object features will decrease with the increase of the training time scale.
combined the reconstructing the input and predicting the future to create a composite model. The framework of them are show in Fig. 8. However, the precision of object features will decrease with the increase of the training time scale. Figure 8 The framework of composite LSTM model [125] The above-mentioned methods mainly in favor of appearance constraint and predicting future frame directly. Unlike those works, some focuses on predicting transformations required for future frame prediction [24,131,135]. To get the better result, Shin et al. [118] proposed a hybrid deep learning model consist of video feature extractor and anomaly detector, while the main limitation was the unseen data. In addition to the frequently used spatial constraints on strength and gradient. Xia et al. [148] developed a feature prediction framework with a novel temporal attention The above-mentioned methods mainly in favor of appearance constraint and predicting future frame directly. Unlike those works, some focuses on predicting transformations required for future frame prediction [24,131,135]. To get the better result, Shin et al. [118] proposed a hybrid deep learning model consist of video feature extractor and anomaly detector, while the main limitation was the unseen data. In addition to the frequently used spatial constraints on strength and gradient. Xia et al. [148] developed Information Technology and Control 2021/3/50 532 a feature prediction framework with a novel temporal attention mechanism. Such spatial and temporal constraints promote the future frame prediction for normal events, and accordingly promote to recognize those abnormal events that do not matching the expectation. Similarity, Chen et al. [25] proposed a framework based on bidirectional prediction, they evaluation the deviation between predictive frame and corresponding ground truth to detect abnormal events. However, the generalization ability of this method need to be improved and the adaptive adjustment strategies for the hyper-parameters require further researches [13,44]. Recently, a lot of great works are credited with the emergence of generative adversarial network (GAN) [83,89,116]. It trained in semi-supervised learning model that have shown excellent promise, even with very few labeled data [116]. The basic idea of GAN is it composed of a generator G (a decoder), and a discriminator D (an binary classifier). D and G are simultaneously optimized by the following two-player minimax game with objective function V(G, D):~( ) min max ( , ) For the generator G, it is to learn a distribution p over data x with a mapping G(z) of samples z. For the discriminator D, it is a standard CNN that maps a frame to a single scalar value D(·) [31]. A variant of GAN method known as Adversarial AE [49,106]. It use adversarial training to add a prior knowledge on the latent code learnt with hidden layers of autoencoder that are also bring out to effectively calculate the input distribution.
Other works combined attention mechanism which used both "soft attention" and "hard-wired" attention [44], while Batchuluun et al.
[13] combined the fuzzy method for behavior recognition. In Table 5, some popular prediction methods are presented. Table 4 Advantages and disadvantages of modeling normal event methods

Method Description Advantages Limitations
Reconstruction-Computing the deviation between the test data and the normal patterns.
It could be done in different ways and it is easier to be applied on other scenes.
It is hard to assure the anomaly data with a larger reconstruction error.

Domain-
It commonly state as a region of the normal videos to describe the domain of normal sample.
Eliminating the effect of depth of field on motion amplitude.
Large time consuming and waste storage for modeling lots of normal events.
Probabilistic-Judging the amount of information that test video should contain.
It does not require clustering or prior assumption in contrast to the existing solutions.
The threshold of anomaly detection was difficult to determine.

Distance-
The normal samples belong to one class while the abnormal samples are the opposite.
It is structurally easy to combine with other features.
It has poor robustness and weak scalability. 3 Query model Abnormal patterns are the "interesting" objects that attract human observers attention to a certain extent, and always easy to recognize. Such salient events are so since they are unlike the regular patterns in that context. The query methods are composing the new video data employing spatiotemporal patches that are extracted from previous data. Thus, the regions in the new data which could be composed from the previous data are considered to be anomalies. Typical algorithm [15] presented a new graph-based Bayesian inference method to detect the patches and a probabilistic graphical model to achieve the inference by composition task. An area in the query frame is considered applicable if it has a large enough contiguous area of support in the video data. New valid frame can be inferred from the database, even though they have never been appeared. The basic concept is shown in Fig. 9. For a query frame (a), we can infer the query from the database (b), the database with the corresponding area of support (c). Finally, we can find an ensembles-of-patch with more flexible and efficient form (d). Related works have been applied in classbased object recognition [41,43].
Unnatural events are boundless in real world scene, and it is almost unrealistic to gather the total of abnormal events and tackle the problem with a classification method [12]. Some works use the statistical computations [41]. For example, spatial image saliency methods are used by [54], the authors adopted probability estimates and multi-dimensional histogram to find anomalies. The methods of composing new data from previous patches are widely used for a variety of works. Besides, query can be estimated using several methods [34], e.g. Markov network [45] and spatio-temporal patches [143]. For decreasing the computing cost, the dimensionality reduction and nearest neighbor search are applied, which can achieve robust results with small-scale training data. In addition, Leibe et al. [77] built an implicit shape model which combined both identification and segmentation into a probabilistic framework. Sivic et al. [122] added geometric constraints to the non-classbased descriptors. In the review of [158], the authors proposed an improved SVD method to match image pairs. Despite its effectiveness in abnormal detection, methods that fall under this category inherit a problem of difficulty in composing the rules and matches.

Deep hybrid model
Deep learning have an advantage of exploring the intricate relationships in multi-dimensional data, and it consistent refreshed the records in many fields, such as computer vision and information retrieval. Meanwhile, machine learning method has been a long period of development, and it explores much knowledge in human behavior detection which the knowledge is universal in the field. At present, the researches begin to focus on the transfer of machine learning methods to deep learning methods, namely deep hybrid models, and it achieves better results,.
The most common strategy is to use deep neural network as feature selection, and then the features are input to the classic machine learning algorithms [38]. Specifically, the common used deep neural networks include AE artificial neural networks [7], LSTM neural network [38] and one class neural networks [20]. Given the video data contain a small amount of labeled data with a large amount of unlabeled data, semi supervised methods is applied to the anomaly detection for intelligent video surveillance [57,121]. For example, Shin et al. [118] proposed a hybrid learning method which the feature extractor trained by GAN and anomaly detection improved by transferring the extractor. In another study, Du et al. [32] proposed a wireless vision sensor network. Hu et al. [58] put forward a spatial-temporal CNN and the deep features were passed in the least squares The basic concept of inference by composition [15] iii) Query model Abnormal patterns are the "interesting" objects that attract human observers attention to a certain extent, and always easy to recognize. Such salient events are so since they are unlike the regular patterns in that context. The query methods are composing the new video data employing spatiotemporal patches that are extracted from previous data. Thus, the regions in the new data which could be composed from the previous data are considered to be anomalies. Typical algorithm [15] presented a new graph-based Bayesian inference method to detect the patches and a probabilistic graphical model to achieve the inference by composition task. An area in the query frame is considered applicable if it has a large enough contiguous area of support in the video data. New valid frame can be inferred from the database, even though they have never been appeared. The basic concept is shown in Fig. 9. For a query frame (a), we can infer the query from the database (b), the database with the corresponding area of support (c). Finally, we can find an ensembles-of-patch with more flexible and efficient form (d). Related works have been applied in classbased object recognition [41,43].
(a) (b) (c) (c) Figure 9 The basic concept of composition [15] Unnatural events are boundless i scene, and it is almost unrealistic total of abnormal events and problem with a classification m Some works use the statistical c [41]. For example, spatial ima methods are used by [54], adopted probability estimates dimensional histogram to find an methods of composing new previous patches are widely variety of works. Besides, qu estimated using several method Markov network [45] and spa patches [143]. For decreasing the cost, the dimensionality red nearest neighbor search are app can achieve robust results with training data. In addition, Leib built an implicit shape mo combined both identifica segmentation into a probabilistic Sivic et al. [122] added geometri to the non-class-based descrip review of [158], the authors p improved SVD method to match Despite its effectiveness in detection, methods that fall category inherit a problem of  Figure 9 The basic concept of inference by composition [15] Unnatural events are boundless in real world scene, and it is almost unrealistic to gather the  SVM to implement classification. Certainly, in many cases, the hybrid learning system has limitations that it is hard to achieve detecting abnormal behaviors by end-to-end learning.
Following the achievement of transfer learning to acquire plenty of outstanding features from models pretrained on large-scale data, deep hybrid models have also adopted these pre-trained methods for feature extracting with desired result [108]. For example, Visual geometry group network 19 (VGGNet-19) were used for transferring learning in [3,10]. One of the most common approaches among deep learning is the CNN network. This is because CNN are quite effective in processing unstructured raw data [92,111]. The deep hybrid approaches will keep on expanding the scale of the model to adapt the intricacy of the data. Therefore, it needs lots of data to fit a more efficient model. Some methods of deep hybrid model are presented in Table 6.

Performance Evaluation
In the current section, the representative approaches have been reviewed. The performance evaluation is critical to measure the validity of the proposed methods and also to compare it to other methods.

Datasets
Large numbers of abnormal behavior recognition methods have been proposed which show it is a hot topic, and there is a growing need for the popular datasets to use for intelligent video surveillance system. In this section, we reviewed the common used datasets for abnormal event detection. An overview of all listed datasets is provided in Table 7. We also show the websites of the data source. Some more related new datasets that published in recent years are shown in [6,33,36,105].

Evaluation Metrics
For better examining the strengths and weaknesses of anomaly event detection methods and the corresponding applicability in different scene/task of interest, it is necessary to evaluate and judge the performance using suitable evaluation metrics. Evaluation of anomaly event detection methods can be divided into different levels: pixel level, frame level, and object level [101]. The basic measures are binary decisions and they are shown in Table 8. _ Pixel level. Pixel-level evaluation considers each pixel individually and one detected anomalous frame is anomalous if the percentage at least 40% truly abnormal pixels are detected. The basic measures are TPR, FPR, TNR, and FPR (shown in Table 8).
Frame level. The abnormal detection methods detect a frame as an anomaly if they detect at least one pixel as anomalous. However, they ignore spatial localization of anomalies which may miss the true anomaly and mislabels a normal pixel as an anomaly. In order to avoid the weakness of misses and mislabels, some relate works use the pixel level ground-  Frame is counted as true positive if TP / (TP + FN) ≥ α (8) Frame is counted as false positive if TP / (TP + FN) < α, (9) where α is set to 40 commonly, which means the scheme requires at least 40% overlap between the prediction and the ground-truth.
Another relates evaluation metric is dual-pixel, which improve the evaluation of frame-level with localization by punishing the number of FP. One frame Information Technology and Control 2021/3/50 536 is detected as an anomaly if at least α % of groundtruth anomaly pixels are detected and at least β % of predicted anomaly pixels are detected. Similarity, intersection-over-union measure is widely used and it is also considered the TP and FP. One frame is detected as an anomaly if the ratio of the number of ground-true anomaly pixels divided by the number of ground-true anomaly pixels is equal or greater than the threshold γ. The measures are shown as follows: Frame is counted as true positive if TP / (TP + FP + FN) ≥ γ (10) Frame is counted as true positive if TP / (TP + FP + FN) < γ, (11) where γ is set to 50 commonly.
_ Object level. Evaluates not only in frame level but also considers spatial location. The correct detection is developed by the detected abnormality area and the true abnormality area, then calculate the Intersection / Union > threshold ν.
To describe the operation of the anomaly detection algorithm in various situations, the common operation is calculating the measures in all cases and giving the performance curves with the resulting set, e.g., receiver operating characteristic (ROC) curves and area under the curve (AUC).
ROC is a curve of TPR versus FPR, which indicates the variation trend of the number of correctly detected abnormal change with the number of normal incorrectly detected as abnormal. However, it loses sight of the difference in class sizes which may lead to misjudgment.
AUC is used to interpret and aids to acquire the general trends in ROC or PR curve. Similarity, some other metrics that are used to evaluate the performance, such as the equal error rate (EER): the ratio of false detection when the FPR equals to the miss rate and accuracy (A): A = (TP + TN) / (TP + TN + FP + FN).
For an ideal anomaly detection method in video, the EER should be smallest possible. Contrary to AUC, the higher score of AUC, the better performance. Samples of the performance evaluation of previous excellent papers are shown in Table 9. Table 9 Summary of the performance evaluation results (Note, in table head of "Datasets": 1 is Ped1 data, 2 is Ped2 data, 3 is Subway entry data, 4 is Subway exit data, 5 is UMN data, 6 is Avenue data, 7 is ShanghaiTech data, which are corresponding to the

Conclusion and Comments for Further Research
Building an intelligent abnormal human behavior detection system in video surveillance is essential to address human fatigue and inattention when monitoring many surveillance scenes over an extended period of time. Indeed, the detection of an abnormal behavior in video surveillance enables the human observers focus on the scenes which are more likely to contain abnormal behaviors. These technologies can support the security agents by monitoring normal behaviors and early detection of abnormal behaviors in large scale scenes.
In this review, we discussed the various abnormal human behavior detection methods. For each category of anomaly detection techniques, we described the assumption regarding the notion of normal and abnormal data along with its advantages and disadvantages. First, we discussed the definition and related surveys for abnormal event detection in videos. Then we provided a comprehensive overview of representative approaches that covers the feature extraction and event modeling and detection. Finally, we exhibited the most popular datasets and evaluation metrics used for abnormal human behavior detection in intelligent video surveillance. We note that while valid success has been achieved in this interrelated research field, some more work needs to be done as indicated next.
All these make abnormal behavior detection a difficult task. The relate approaches attempt to build computational action models to automatically identify whether a behavior is normal or not. Specifically, for the suspicious behaviors may have several interpretations, they lie with the context, the time and the place of the event.
One possible resolution is using large amount of training data including as much scenarios as possible. For the scene change, the effective solution is choosing features that are robust to scene transformations and less sensitive to the appearance of object. For the visual restrictions of single-camera, some works that use multiple cameras to acquire different views. To handle large amounts of data, it is become a trend to use effective feature selection strategy and deep learning to work efficiently. Furthermore, thanks to the strong learning ability of deep learning method, it achieves the optimal detection result. It is the future direction of development, has broad prospects. Traditional machine learning methods have explored a lot of knowledge related to anomaly detection in video surveillance through a long time of development. These methods are universal in the field, and need to be transferred to the deep learning method to achieve better results. In a word, the anomaly detection in video is still a hot area of research, a possible future survey would be to extend and improve with more mature techniques are put forward.