A Survey on Regression-Based Crowd Counting Techniques

Traditional detect and count strategy cannot well handle the extremely crowded footage in computer vision-based counting task. In recent years, deep learning approaches have been widely explored to tackle this challenge. By regressing visual features to density map


Introduction
Crowd counting techniques are essential to ensure the public safety, such as preventing stampede in a parade, or optimizing layout of the site.To estimate the number of people within a district, a common strate-gy is to count the number of mobile phones accessed to the base station [1].This strategy is generally effective, but it cannot reveal the local crowd density in the certain area with high risk, such as crossroad and pla-za.To address this issue, image/video captured with fixed cameras can be exploited by computer vision techniques to count the crowd in the footage.
In early stage, techniques of pedestrian detection are utilized for crowd counting.This strategy attempts to detect every pedestrian in the scene, and accumulate the result to obtain the final count.The Histogram of Gradient (HOG) feature with Support Vector Machine (SVM) is a common approach [54].In this approach, a window slides through the entire footage to obtain image patches.For each patch, the HOG is extracted and feed to SVM to classify if current patch contains a pedestrian, as illustrated in Figure 1(a).Conventional approaches are further improved to deep learning-based, such as YOLO, for a higher detection accuracy [11].However, as density of the crowd increases, heavy occlusion and insufficient information for single pedestrian will significantly impact the performance, as illustrated in Figure 1(b).
Figure 1 (a) Crowd counting with HOG and SVM approach in [9].(b) Crowd with extremely high density which cannot be properly handled by the detection-based approaches Therefore, an alternated strategy is required to handle the crowd counting with extreme high density.
The regression-based approach addresses this issue via learning a mapping relationship between extracted visual features and estimated count.These approaches model density maps from ground-truth information as the regression target.In the training phase, extracted features are regressed towards the ground-truth/pseudo density map.In the detection phase, the trained model is exploited to predict the count with extracted features.The Figure 2 illustrates the methodology of regression-based approach.For each image, all pedestrians' heads are manually annotated with crosses as ground truth.A common approach to generate ground-truth density map D GT , is to calculate the convolution of spatial information and a gaussian kernel G, which can be expressed as Equation (1).
pedestrian in the scene, and accumulate the result to obtain the final count.The Histogram of Gradient (HOG) feature with Support Vector Machine (SVM) is a common will significantly impact the performance, as illustrated in Figure 1(b).Therefore, an alternated strategy is required to handle the crowd counting with extreme high density.
where δ(xx i ) represents the existence of pedestrian at spatial position x i , the value is 1 if the pedestrian exists, otherwise it is set to 0. It can be observed that spatial position with higher crowd density corresponds with larger magnitude on the map.Simultaneously, feature map is extracted with front-end backbone network such as VGG16.Extracted features are fed to the back-end regression head to generate the regressed density map D E .The D E illustrated in Figure 2 is generally identical to D GT except partial of background is unsuccessfully verified as the pedestrian.The predict count C P can be simply obtained with a linear relation , where ratio k can be learnt from the groundtruth.The selection of loss is also crucial since it can significantly impact the prediction performance.The back-propagation process can adapt either Local or Global Loss in various techniques.The local loss measures the difference between the GT and estimated density maps.And the global loss measures the difference between actual and predict counts.Generally, the regression-based approach successfully overcomes the heavy occlusion issue which cannot be well handled by old school techniques, and makes itself a proper candidate on crowd counting.
However, the above-mentioned architecture is a primal procedure of regression-based approach, which local crowd density in the certain area with high risk, such as crossroad and plaza.To address this issue, image/video captured with fixed cameras can be exploited by computer vision techniques to count the crowd in the footage.
In early stage, techniques of pedestrian detection are utilized for crowd counting.
This strategy attempts to detect every pedestrian in the scene, and accumulate the result to obtain the final count.The Histogram of Gradient (HOG) feature with Support Vector Machine (SVM) is a common approach [54].In this approach, a window slides through the entire footage to obtain image patches.For each patch, the HOG is extracted and feed to SVM to classify if current patch contains a pedestrian, as illustrated in Figure 1(a).Conventional approaches are further improved to deep learning-based, such as YOLO, for a higher detection accuracy [11].However, as density of the crowd increases, heavy occlusion and insufficient information for single pedestrian will significantly impact the performance, as illustrated in Figure 1(b).Therefore, an alternated strategy is required to handle the crowd counting with extreme high density.Since the feature map is regressed to D E , whose quality will directly impact the calculation of loss and counting result.Thus, research have been made to optimize the feature extraction process with various tactics.(1) The first encountered challenge is the pedestrians' heads often have different sizes within a footage since the existing of perspective.To extract the most appropriate feature with CNN-based network, multiple kernels with various scales are adapted.In this circumstance, approaches [56,38,15] attempt to devise novel networks to model better feature maps.(2) As a matter of fact, the 'ground-truth' density map in Figure 2 is not real ground-truth but generated.Thus, the generating process of D GT is optimized in some approaches.Furthermore, to entirely bypass the impact of inaccurate D GT , some approaches [41,26,46,18] attempt to regress features into head positions instead of density map before calculating the The architecture of regression-based crowd counting approach loss.
(3) The regression head requires tremendous amount of labelled data for training.However, the manual labeling of data for crowd counting is extremely exhausting.Therefore, the approach [29] attempts to achieve a satisfying performance with weak-supervised technique and limited labelled data.(4) The vision transformer has advantages such as global attention mechanism and weak supervised.Some approaches [16,44,18] exploit the transformer to encode feature map into sequences instead of density map, then predict the total count directly.(5) Features from sources other than image are expected to strengthen the counting result.For example, the approach [20] incorporates thermal information into the CSRNet [15] to improve the accuracy.ion into the CSRNet [15] to improve the accuracy.
2 The unbalanced prediction of density map: By analyzing the density map D E predicted by the regression network, it is not hard to find the crowd area with ordinary density often get the most accurate prediction.The area with highest density will be over-estimated and the area with lowest density will be under-estimated.Efforts [53,14,42] have been made to address this issue.For example, the strategy of Learning to Scale (L2S) module [53] Information Technology and Control 2023/3/52 696 is to segment and rescale the most crowded area.
The rescaled area will have sparser density level and yield more accurate prediction.The Attention Scaling Network (ASNet) [14] segments D E according to density levels and applies scaling factors on each area to adjust the unbalanced prediction within D E .The L2S and ASNet attempt to adjust density values in certain areas, which is a coarse approach.The Scale-Adaptive Selection Network (SASNet) [42] applies the so-called Pyramid Region Awareness Loss, which refines the adjustment to pixel-level and yields better prediction.
3 Modeling the loss function: A proper loss function can update network parameters more effectively in back-propagation process and generate better model.(1) The most common way is to adapt L 1 or L 2 norm for either global or local loss.However, due to the uneven distribution of the crowd, these ordinary losses cannot well-handle the area with the highest density.Therefore, some approaches aim to implement pyramid strategy on loss calculation to improve performance.The ASNet iteratively divides D E , and calculates loss on patch with total density lower than certain threshold.The SASNet applies similar strategy, but upgrades the approach to pixel-level.(2) The essential of typical density map-based approach is mapping density value to certain count.However, the pedestrian's head position can be predicted according to the probability of each point on D E .Therefore, the Bayesian loss is adapted by various approaches [26,46,18] and achieves sound results.The probability-based approach directly predicts head positions, which is an advantage the typical approach does not have.
In this article, we will review and analyze the most representative approaches with their innovations based on the 3 above-mentioned topics.

Single-Column Outperforms Multi-Column
The Switch-CNN [2] adapts the hybrid strategy of Crowd-CNN and MCNN.By setting a switch classifier before the multi-column backbone, divided image patches are fed to the single selected column according to the decision of the switch.Despite the Switch-CNN is not an end-to-end network, it still outperforms the MCNN.This observation indicates the multi-column structure does not always provide more 'valid' features than single-column.
Besides, the primary defect of multi-column network structure is the low efficiency of training process in each column.Moreover, experiment results indicate the actual performances of each column are identical, which cannot reveal the nature of different density levels as expected.One strategy to address this issue is replacing the MCNN's multi-columns with a single-column structure, to increase the efficiency of feature extraction.The perception process of various density

The Inaccurate 'Ground Truth' Density Map
According to the architecture illustrated in Figure 3, the ground truth density map D GT is obtained by convolving annotated head positions with kernel G. Since the D GT will be used to calculate the Loss, it directly impacts the accuracy of the network.Ideally, the D GT should objectively describes the actual distribution of the crowd.However, when a footage contains pedestrians both far and near toward the camera, occupied pixels of their heads will vary.If the scale of kernel G is fixed, the obtained D GT will barely be optimized.Naturally, various scales of G can be applied to generate density map for different head sizes, which leads to the issue to be addressed: the strategy to implement kernels with various scales.
A common approach is to select the variance of G for pedestrian x i according to the average distance of his/ her m nearest neighbors.Assuming the m nearest distances to }, the average distance can be expressed as Equation (2).can be applied to generate density map for different head sizes, which leads to the issue to be addressed: the strategy to implement kernels with various scales.
A common approach is to select the variance of  for pedestrian  � according to the average distance of his/her  nearest neighbors.Assuming the  nearest distances to  � is { � � ,  � � , … ,  � � } , the average distance can be expressed as Equation (2).
Therefore, the optimized  �� � can be generated from the kernel  � with dynamic variance  as Equation (3).
Therefore, the optimized D' GT can be generated from the kernel G σ with dynamic variance σ as Equation (3).
Ablation experiments indicate using D' GT will yield better prediction, and more research explored ways to generate D' GT with higher quality.The Point to Point Network (P2PNet) [41] proposed a novel strategy to eliminate the impact of inaccurate ground truth density map.Instead of regressing features toward D GT , the P2PNet directly regressed the VGG16 extracted feature map toward head points δ(x i ).A parallel classification network branch is then applied to provide confidence score of the predicted head points.Its Loss comprises with both regression and classification Losses.This strategy makes P2PNet to achieve the highest performance until 2022 to our best knowledge.

Weak/semi-supervised Solutions
The training of regression-based approaches usually requires large number of annotated data.However, the manual annotation of ground truth data for crowd counting is extremely exhausting -usually thousands of people need to be labelled for a single image.For example, the benchmarking ShanghaiTech A dataset [56] includes 482 images with average 1000+ pedestrians for each image.Therefore, weak/semi-supervised approaches [10,29,30,59] are proposed to address this issue.The strategy of weak-supervised solution is adapting small-sample approach such as the Transformer [10].The approach using transformer will be introduced in Section 2.5, and we will introduce semi-supervised solution here.
The Mean Teacher [43] is a semi-supervised network based on Temporal Ensembling and the ∏ model proposed in 2017.The mean teacher is composed with a double-routes teacher/student network as typical semi-supervised approaches.Unlike others, it updates the parameters of the teacher network with exponential moving average (EMA) to enhance performance, instead of replicating the student's parameters.
For the problem of crowd counting, labeled data are fed to student network and unlabeled data are fed to teacher network.Since the teacher's parameter is updated with the student, once trained, both networks can be used to predict the density map.Thus, the semi-supervised training is achieved.For further optimization, Semi [29] adapted the mean teacher as the baseline structure with the binary segmentation.
The observation indicates that spatial information can be utilized to segment the crowd and background [57], which will improve the counting performance.Therefore, Semi exploits the binary segmentation to estimate the uncertain spatial regions from the regressed density map.The uncertainty map is obtained by calculating the entropy of density map and filtering with a threshold.With the uncertainty map, the density value in background area will be removed.

Replacing CNN with Transformer
The mechanism of CNN-based approaches for crowd counting determines its receptive filed is often local, since limited scales of the convolution.Despite initially utilized for natural language processing like LSTM, the Transformer [45] devised by Google is adapted by various computer vision tasks [7,4,58].Due to its attention mechanism, the Transformer possess the global receptive field which can directly estimate the total crowd number, instead of accumulating the local predictions.To be specify, it is firstly adapted by Detection Transformer (DETR) [4], which utilizes CNN for feature extraction and Transformer for classification.The Vision Transformer [7] is the first to purely exploit the Transformer to address the task of image classification, by sequencing patches with the encoder instead of extracting features.The transformer is also used to generate the pre-trained image model in the Segmentation Transformers (SETR) [58].The detailed introduction of exploiting the Transformer on image processing can be found at sources [10,31,19].Another feature of transformer, as well as other language processing network such as RNN, is they can also be weak-supervised.Transformer can conduct a pre-training on the non-supervised large dataset, and then complete the training with an annotated small-sample dataset.Therefore, this weak-supervised feature makes transformer another possible candidate to handle the crowd counting task with limited annotated data.
The TransCrowd [16] adapted the pure Transformer to achieve the crowd counting.As illustrated in Figure 4, its general strategy is encoding the divided patches into the vector sequences as the input, and feeding them to the encoder.Then, the encoded sequences processed with either regression token or global average pooling, will be used to predict the count with an ordinary regression head, instead of a decoder.Specifically, the image patches are firstly flattened to N sequence, represented as {x i |i = 1, …, N}.Next, x i is mapped with a learnable matrix E into a latent D-dimensional embedding feature.Furthermore, the spatial information {p i |i = 1, …, N} is integrated as well to generate the input Z_0 for the encoding process.1, … , } is integrated as well to generate the input  � for the encoding process The Multifaceted Attention Netwo The encoder comprises with multiple layers of Multi-head Self-Attention (MSA) and Multilayer Perceptron (MLP).The output  � of the  −  layer can be expressed as Equation (5), where  represents the layer normalization process.
where the  calculates  Self Attention models  � with a reprojection matrix  � , expressed as The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The regression token procedure attaches an additional token  � � to  � .After the  for the l-th layer.For the encoding backbone, the Twins network [18] is adapted.The The Multifaceted Attention Network (MAN)  The encoder comprises with multiple layers of Multihead Self-Attention (MSA) and Multilayer Perceptron (MLP).The output Z l of the l-th layer can be expressed as Equation (5), where LN represents the layer normalization process.
1, … , } is integrated as well to generate the input  � for the encoding process.
The encoder comprises with multiple layers of Multi-head Self-Attention (MSA) and Multilayer Perceptron (MLP).The output  � of the  −  layer can be expressed as Equation (5), where  represents the layer normalization process.
where the  calculates  Self Attention models  � with a reprojection matrix  � , expressed as The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The regression token procedure attaches an The Multifaceted Attention Network (MAN) [18] also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
Where • is the Hadamard product.And the global attention  ��� can be expressed in an ordinary transform way.Note the  ��� and  ��� are sharing the same value vectors  � .
The encoder comprises with multiple layers of Multi-head Self-Attention (MSA) and Multilayer Perceptron (MLP).The output  � of the  −  layer can be expressed as Equation (5), where  represents the layer normalization process.
where the  calculates  Self Attention models  � with a reprojection matrix  � , expressed as The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The regression token procedure attaches an additional token  � � to  � .After the  and  , the  � � contains the global semantic crowd information.This strategy is adapted by Bert [5] The Multifaceted Attention Network (MAN) [18] also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
Where • is the Hadamard product.And the global attention  ��� can be expressed in an ordinary transform way.Note the  ��� and  ��� are sharing the same value vectors  � .
The final attention A can be obtained as： where the MSA calculates m Self Attention models SA m with a reprojection matrix W o , expressed as The SA m can be obtained with the typical Query(Q) /Key(K)/ Value(V) paradigm of classic transformer.The MLP uses 2 linear layers with the GELU [6] activation function to expand and shrink the embedding dimension of the feature.
1, … , } is integrated as well to generate the input  � for the encoding process.
The encoder comprises with multiple layers of Multi-head Self-Attention (MSA) and Multilayer Perceptron (MLP).The output  � of the  −  layer can be expressed as Equation (5), where  represents the layer normalization process.
where the  calculates  Self Attention models  � with a reprojection matrix  � , expressed as The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The Multifaceted Attention Network (MAN) During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
Where • is the Hadamard product.And the global attention  ��� can be expressed in an ordinary transform way.Note the  ��� and  ��� are sharing the same value vectors  � .
The final attention A can be obtained as： The obtained Z l is further processed with either Regression Token or Global Average Pooling before sent to the regression head.The regression token procedure attaches an additional token Z t o to Z 0 .After the MSA and MLP, the Z t l contains the global semantic crowd information.This strategy is adapted by Bert [5] as well.Applying the global average pooling to Z 0 will generated the pooled visual tokens Z p l .Since the Z p l has more discriminative semantic patterns, it obtained better performance than using Z t o .Similarly, CCTrans [44] also adapted transformer as the feature extraction backbone.Unlike TransCrowd, patches x i of the input image are flattened into a single sequence, then a learnable projection is applied to obtain the input sequence Z l -1 for the l-th layer.For the encoding backbone, the Twins network [18] is adapted.The Twins can perceive both local and global receptive fields via alternated local and global attentions, namely Spatially Separable Self-Attention (SSSA) module.Specifically, for the local attention, the Locally-grouped Self-Attention (LSA) and MLP are applied to LN(Z l -1 ).For the global attention, Global Sub-sampled Attention (GSA) and MLP are further applied to obtain the feature sequence Z l for regression head.
The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The Multifaceted Attention Network (MAN) [18] also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The regression token procedure attaches an During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
The Multifaceted Attention Network (MAN) [18] also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.
The regression token procedure attaches an The Multifaceted Attention Network (MAN) [18] also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
During the encoding phase, the MAN proposed a Learnable Region Attention (LRA) mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
Where • is the Hadamard product.And the global attention  ��� can be expressed in an ordinary transform way.Note the  ��� and  ��� are sharing the same value vectors  � .
The final attention A can be obtained as：  =  ��� +  ��� �10� (7.4) The Multifaceted Attention Network (MAN) [18]  mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.The final attention A can be obtained as：  =  ��� +  ��� �10 The final attention A can be obtained as: where mechanism to optimize the local value within the final attention.After the region mask  � is obtained with the LRA, the regional attentions can be expressed as.
Where • is the Hadamard product.And the global attention  ��� can be expressed in an ordinary transform way.Note the  ��� and  ��� are sharing the same value vectors  � .
The final attention A can be obtained as：  =  ��� +  ��� �10� .(10) In summary, the core strategy of TransCrowd, CCTrans and MAN, is to strengthen the global attention via involving local attention during the encoding phase.The TransCrowd divides image into patches, and models self-attention SA for each patch as local attention.Then, the sequence of local attentions is fed to the regression head.The CCTrans encodes the image into a single sequence and perceives local attention before global attention.The MAN obtains the regional and global attentions separately, then integrates them into the final attention.De-spite adapting different encoding procedures, all approaches receive sound results in the experiments.

Features from Other Sources
All above-mentioned approaches only exploit visual features extracted from images or video frames.Research [20,23,52] attempts to improve the counting performance via utilizing features from other source.The multi-model techniques include multi-model representation, translation, alignment, multi-model fusion and co-learning.The multi-model fusion integrates various types of information and gives prediction.A typical application of multi-model fusion in image processing is the visual-audio recognition, which extract visual and audio features to perform personal identification.
The Information Aggregation Distribution Module (IADM) [20] devised a multi-model approach, which incorporates thermal information with visual features.In the Information Aggregation Transfer phase, 3 branches of CSRNet are used to extract visual, thermal, and modality-shared features.The modality-shared feature describes the complementary information between visual and thermal features.In the information distribution transfer phase, the contextual information obtained from modality-shared feature is used to refine the thermal and visual features for the regression of density map.

The Problem of Unbalanced Density Estimation
The ASNet [14] observed a phenomenon that the sparse area in the regressed density map D E often yield smaller predict count than ground truth, and dense area in D E often yield larger count.Therefore, the performance of existing regression-based approaches is significantly impacted on datasets with wide density range, such as UCF-QNRF [13] and UCF_CCF_50 [12].The strategy to address this issue is rescaling the regions within the image to same density level before/after the prediction.Two approaches are proposed to handle this issue.

Rescaling Regions of the Image into Identical Density Level Before Prediction
The L2S [53] attempts to locate regions with high density, and rescale these regions until the density is identical to the sparser region.The partially rescaled image has more evenly distributed density, and the prediction is expected to be more accurate.As illustrated in Figure 5, an initial density map is firstly predicted with the regression-based network.Next, the dense region is selected by a threshold, and the L2S module is exploited to generate a Scale factor.The dense region is then rescaled with the factor, and fed to the network again to obtain the optimized local Density map re-scale process proposed in L2S [53] prediction.This approach achieved the best performance in 2019 crowd counting game CV101.

Predicting the Density Map and Applying Factors to Regions with Different Density Level for the Magnitude Adjustment
The ASNet [14] devised a post-processing mechanism to optimize D_E.As illustrated in Figure 6 The above-mentioned approaches attempt to align density of regions within the image, which boost the performance on certain dataset.However, if the approach is practically applied, data for prediction can be obtained from various sources and datasets.The experiment result of Scale Distribution Alignment (SDA) [25] indicates the performance of a state-of-the-art [26] suffered a substantially decrease by simply rescaling testing images from the same dataset for training.It is expected images with different scales from other dataset will case a greater impact on prediction.
The strategy of SDA is to align the image scales of Rescaling process of Density map in ASNet [14] multiple datasets with learnt rescaling factors.In this approach, the Scale Distribution Network (SD-Net) is devised to estimate the scale distribution of each image.Next, images are divided into patches for the alignment by re-scaling them with the optimal translation factor.The factor for each patch is calculated with its actual distribution and Wasserstein barycenter of the estimated scale distribution.With this process, the scale distributions for 4 benchmarking datasets are aligned to the same level.The ablation experiment shows the aligned database outperforms the original on main-stream approaches such as CSRNet and BL [26].

Basic Loss
The loss function evaluates the difference between the predicted and ground-truth count, which is exploited to adjust the weight of network in the back-propagation process.Therefore, the appropriate selection of Loss function can directly boost the accuracy of prediction.The straight-forward way is calculating the L 1 norm between the estimated density map D E and GT density map D GT , which can be expressed as ||D E -D GT || 1 or the following equation, where M is the total amount of images for training.
prediction.The straight-forward way is calculating the  � norm between the estimated density map  � and GT density map  �� , which can be expressed as ||  � −  �� || � or the following equation, where  is the total amount of images for training.
subregions, the APL  �� can measure the difference more accurately in extremely crowded regions.As Figure 7 illustrated, the  �� can be defined as where the density map  is divided into 4 with count larger than threshold  into 4 subregions, the APL  �� can measure the difference more accurately in extremely crowded regions.As Figure 7 illustrated, the  �� can be defined as where the density map  is divided into 4 subregions  � ~� in the first iteration  � , and the Loss  � � � � for each subregion can be defined as (12) Since transformer-related approaches encode features into sequences instead of density maps, D E and D GT are replaced with the predicted count C P and actual count C GT .As claimed in Section 1, we note the density map generated loss as Local Loss, and the count generated loss as Global Loss.
To improve the prediction performance, research [44,14,42,26,46,18] are conducted to further optimize the loss function.The Smooth L 1 is an alternate version of L 1 norm, and often adapted to handle the exploding gradient problem.It is a hybrid of L 1 and L 2 , and can be expressed as Equation (13).CCTrans [44] uses the Smooth L 1 as the Loss and claims to obtain better performance than the ordinary L 1 .To improve the prediction performance, research [44,14,42,26,46,18] are conducted to further optimize the loss function.The Smooth  � is an alternate version of  � norm, and often adapted to handle the exploding gradient problem.It is a hybrid of  � and  � , and can be expressed as Equation (13).CCTrans [44] uses the Smooth  � as the where the density map  is divided into 4 subregions  � ~� in the first iteration  � , and the Loss  � � � � for each subregion can be

Pyramid Loss
As mentioned in Section 3, the unbalanced density will impact the accuracy of count prediction.Regions with identical density level still have different density values.This issue also affects the loss calculation while training.Therefore, the ASNet [14] devised the Adaptive Pyramid Loss (APL) to handle the unbalanced density within the predicted density map.By iteratively dividing the map with count larger than threshold T into 4 subregions, the APL L AP can measure the difference more accurately in extremely crowded regions.As Figure 7 illustrated, the L AP can be defined as original on main-stream approaches such as CSRNet and BL [26].

Basic Loss
The loss function evaluates the difference between the predicted and ground-truth count, which is exploited to adjust the weight of network in the back-propagation process.
Some approaches such as SASNet adapt

Pyramid Loss
As mentioned in Section 3, the unbalanced density will impact the accuracy of count prediction.Regions with identical density level still have different density values.This issue also affects the loss calculation while training.Therefore, the ASNet [14] devised the Adaptive Pyramid Loss (APL) to handle the unbalanced density within the predicted density map.By iteratively dividing the map with count larger than threshold  into 4 subregions, the APL  �� can measure the difference more accurately in extremely crowded regions.As Figure 7 illustrated, the  �� can be defined as where the density map  is divided into 4 subregions  � ~� in the first iteration  � , and the Loss  � � � � for each subregion can be defined as where the density map k is divided into 4 subregions R 1 ~R4 in the first iteration i 1 , and the Loss l k R i1 for each subregion can be defined as Similarly, SASNet devised the Pyramid Region Awareness Loss (PRAL) to handle the over-estimated value of density map.This approach divides the predicted density map into 4 subregions, and locates the most over-estimated subregion.Next, the located subregion will be iteratively divided to pixel-level.All selected pixels will be collected as a hard pixel set H. Then the PRAL L PRAL can be modelled as the following Equation ( 16), where γ is a weight term.
Similarly, SASNet devised the Pyramid Region Awareness Loss (PRAL) to handle the over-estimated value of density map.This approach divides the predicted density map into 4 subregions, and locates the most overestimated subregion.Next, the located subregion will be iteratively divided to pixellevel.All selected pixels will be collected as a hard pixel set .Then the PRAL   can 'ground--truth approaches aim to tackle this issue by skipping the density map.The Bayesian Loss (BL) [26] exploits the probability of every spatial location belongs to the head annotation, to calculate the Bayesian Loss   of the entire image.In this case, each pixel on the feature map will be mapped to all head annotations with probabilities.
can be expressed as Equation (17).
Where  is the total head annotation count within the image, [•] is the expectation,   denotes the count that pixels   belongs to the  − ℎ annotation.Since the head location is annotated with single pixel, the [•] of ground-truth count will be 1.
Considering background pixels can be far away from any pedestrian's head, they should not be mapped to any annotation.
Therefore, BL improved the Bayesian Loss into  + by adapting the expectation of (16)

Bayesian Loss
The ground-truth density map is generated from the head annotations with fixed/dynamic Guassian kernel.In either way, the density map is not real 'groundtruth'.In Section 2.3, approaches attempt to model the most accurate ground-truth density map.On the other hand, some approaches aim to tackle this issue by skipping the density map.The Bayesian Loss (BL) [26] exploits the probability of every spatial location belongs to the head annotation, to calculate the Bayesian Loss L Bayes of the entire image.In this case, each pixel on the feature map will be mapped to all head annotations with probabilities.L Bayes can be expressed as Equation (17).  of the entire image.In this case, each pixel on the feature map will be mapped to all head annotations with probabilities.
can be expressed as Equation (17).
Where  is the total head annotation count within the image, [•] is the expectation,   denotes the count that pixels   belongs to the  − ℎ annotation.Since the head location is annotated with single pixel, the [•] of ground-truth count will be 1.
Considering background pixels can be far away from any pedestrian's head, they should not be mapped to any annotation.
Therefore, BL improved the Bayesian Loss into  + by adapting the expectation of (17) Where N is the total head annotation count within the image, E[•] is the expectation, c n denotes the count that pixels x i belongs to the n-th annotation.Since the head location is annotated with single pixel, the E[•] of ground-truth count will be 1.
Considering background pixels can be far away from any pedestrian's head, they should not be mapped to any annotation.Therefore, BL improved the Bayesian Loss into L Bayes+ by adapting the expectation of back- The BL did not consider the cost of mapping pixels to annotations, which is referred as the Transport Cost in Generalized Loss (GL) [46].
For example, for the crowd far from the camera, heads are more compact.The transport cost should be higher to produce higher Loss value.The GL introduced a generalized Loss based on the hybrid of Discussions

Benchmarking Datasets
Standard datasets are used to measure the performance between different approaches.
The following section gives introductions to mainstream datasets.
The Shanghai Tech (SHT) dataset [56] is proposed with the MCNN.It soon became the first benchmarking dataset and is adapted by most of the regression-based The BL did not consider the cost of mapping pixels to annotations, which is referred as the Transport Cost in Generalized Loss (GL) [46].For example, for the crowd far from the camera, heads are more compact.
The transport cost should be higher to produce higher Loss value.The GL introduced a generalized Loss based on the hybrid of multi-Loss functions and the transport cost.
The BL did not consider the cost of mapping pixels to annotations, which is referred as the Transport Cost in Generalized Loss (GL) [46].
For example, for the crowd far from the camera, heads are more compact.The transport cost should be higher to produce higher Loss value.The GL introduced a generalized Loss based on the hybrid of multi-Loss functions and the transport cost.
=  〈,   〉 − (  ) +  1 +   (19) here  is the transport cost, by minimizing the 〈,   〉 , the predicted density is pushed toward the annotation.The entropic regularization term (•) can make the distribution of density sparser.GL later proved the  1 and   are special and suboptimal cases of   .
The MAN attempts to address the overestimated issue as [14,42] where the instance attention mask  can be expressed as follows.This means if  is larger than the calculated Bayesian loss of 80% annotated points, 20% of points will be pruned in back-propagation process.
Therefore, the issue of over-estimated prediction can be handled when the  is properly set.

Experiments and
Standard datase performance bet The following se The JHU-Crowd (19) here C is the transport cost, by minimizing the 〈C, c n 〉, the predicted density is pushed toward the annotation.The entropic regularization term H(•) can make the distribution of density sparser.GL later proved the L 1 and L Baye are special and suboptimal cases of L c .
The MAN attempts to address the over-estimated issue as [14,42] where the instance attention mask  can be expressed as follows.This means if  is larger than the calculated Bayesian loss of 80% annotated points, 20% of points will be pruned in back-propagation process.
Therefore, the issue of over-estimated prediction can be handled when the  is properly set.The UCF-QNRF [13] to evaluate adaptive where the instance attention mask m can be expressed as follows.This means if δ is larger than the calculated Bayesian loss of 80% annotated points, 20% of points will be pruned in back-propagation process.Therefore, the issue of over-estimated prediction can be handled when the δ is properly set.
generalized Loss based on the hybrid of multi-Loss functions and the transport cost.
=  〈,   〉 − (  ) +  1 +   (19) here  is the transport cost, by minimizing the 〈,   〉 , the predicted density is pushed toward the annotation.The entropic regularization term (•) can make the distribution of density sparser.GL later proved the  1 and   are special and suboptimal cases of   .
The MAN attempts to address the overestimated issue as [14,42] where the instance attention mask  can be expressed as follows.This means if  is larger than the calculated Bayesian loss of 80% annotated points, 20% of points will be pruned in back-propagation process.
Therefore, the issue of over-estimated prediction can be handled when the  is properly set.

Benchmarking Datasets
Standard datasets are used to measure the performance between different approaches.The following section gives introductions to mainstream datasets.
The Shanghai Tech (SHT) dataset [56] is proposed with the MCNN.It soon became the first benchmarking dataset and is adapted by most of the regression-based approaches.The SHT comprises with subsets A and B, pedestrians are manually annotated in all image samples.The subset A contains 300 training samples and 182 testing samples collected from the Internet.The subset B contains 400 training samples and 316 testing samples collected from cameras installed at streets.The crowd density in subset A is significantly higher than subset B. Therefore, some approaches did not choose subset B while using SHT.

Evaluation Metrics
The evaluation of the performance of the approach is straight-forward: if the predicted count matches the actual crowd number.

Evaluation Metrics
The evaluation of the performance of the approach is straight-forward: if the predicted count matches the actual crowd number.

Evaluation Metrics
The evaluation of the performance of the approach is straight-forward: if the predicted count matches the actual crowd number.

Evaluation Metrics
The evaluation of the performance of the approach is straight-forward: if the predicted count matches the actual crowd number.Where C i P is the predicted count, C i GT is the groundtruth, M is the total number of images in the dataset.Furthermore, the mean Normalized Absolute Error (NAE) is a recently proposed metric, which is adapted by some research [49].Additionally, the Grid Average Mean Absolute Error (GAME) [8] is proposed to measure the MAE within different regions.For any level l, the image is divided into 4 l non-overlapping regions.Then the GAME at level i can be expressed as Equation (23).The following table 1 lists the scales and annotation statistics of above-mentioned datasets to provide a straight-forward comparison.

Table 1
Patterns of mainstream datasets for crowd counting Aiming to achieve crowd counting with multi-modal approach, datasets with information from additional sources are also proposed.The RGBT-CC [20] dataset contains 2030 RGB images with their corresponding thermal versions collected from optical-thermal camera.It has total 138,389 annotations.One unique feature of RGBT-CC is that partial thermal/RGB images are captured in darkness, which makes it a fine choice to evaluate performance under limited brightness circumstance.

Evaluation Metrics
The evaluation of the performance of the approach is straight-forward: if the predicted count matches the actual crowd number.Then the GAME at level  can be expressed as Equation (23).stream datasets are partially missed.Thus, MAE and MSE of different approaches on SHT-A will be referred as primary factor for analysis.Furthermore, Figure 8 gives a perceptual illustration to exhibit the trending of performance development.

𝐺𝐺𝐺𝐺𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐸𝐸𝐸𝐸(𝑙𝑙𝑙𝑙
For most approaches proposed in recent years, values of MAE and MSE on SHT-A are limited under 60 and 100.The state-of-the-art approaches are the P2Pnet and SASNet proposed in 2021.The Characteristic function Loss (ChfL) [36] proposed in 2022 has the best performance among Bayesian loss related approaches, which regresses features into points instead of density map.Generally, approaches proposed at 2021 outperform those proposed at 2022.The reason is newest approaches focused more on the exploration of novel network structure, such as adapting transformer and Bayesian loss.Despite current performance is lower, they have exhibited significant advantages and potential which could outperform the main-stream density map-based approaches in the future.
Since the P2Pnet and SASNet achieved the highest performance so far, it is necessary to take a closer inspection of their innovations..

P2Pnet:
The strategy of P2Pnet is a pseudo version of point-regression approaches.Instead of calculating the confidence score of every pixel from the feature map as pedestrian's head, the P2Pnet divides the extracted feature map with a grid of the stride length s.Each cell of the grid is assigned as a potential head candidate/proposal with the confidence score, and this strategy is referred as "one to one match".
The network of P2PNet has an ordinary front/backend structure.Its front-end part is a VGG16 for feature extraction.The extracted feature map will be divided by s to generate proposals who will be fed to the back-end.The back-end network is comprised with a regression head and a classification head.The regression head calculates the confidence score of each proposal.The classification head determines if proposal belongs to pedestrian's head or background according to the confidence score.Finally, the Loss will be calculated with predicted pedestrians and the ground truth.
The P2PNet mentions the "one to one match" can effectively address the unbalance estimation problem who hampers the prediction accuracy.However, the adaptiveness of P2Pnet is limited in practice since the s must be manually set.The inaccurate selection of s will generate direct impact on the prediction of the densest crowd within the footage.Thus, it seems the reason of P2Pnet achieved such outstanding performance in experimental environment, is heavily related to the selection of hyperparameter s.If the selection of s is made self-adaptive, the performance of P2Pnet could be more persuasive.

SASNet:
As the approach with the second-best MAE/ MSE in the

Conclusion
This paper reviews most representative regression-based crowd counting techniques and inspects their innovations in network architecture, the handling of unbalanced prediction and the devising of loss function.As conclusion, some observations and possible future trends is proposed.
Regression to points instead of density map.Point-regression approaches can provide position information which conventional density map based techniques cannot.This strategy is continually explored by Bayesian loss related approaches since 2019.Despite still lessor than the state-of-the-art [41,42], the performance of Point-regression approaches such as BL, GL and Chfl is steadily increasing.Considering the high performance of the state-of-the-art heavily relies on the installation of super parameters in feature extraction process, their practical feasibility still needs to be further evaluated.But point-regression approaches minimize this impact by focusing on the optimization of regression-head.Together with the capability of providing position information, point-regression approaches can be promising candidates for practical application.
Replacing CNN with transformer: As a former natural language processing network, transformer shows magnificent capability in computer vision.
In recent years, multiple transformer-related ap-proaches such as MAN and TransCrowd are proposed and achieved sound performance.By integrating global attention, the transformer-based approaches achieve generally high performance.However, it still cannot match the state-of-the-art.
Another defect is the deep ViT network for encoding, which makes the transformer more difficult to handle the real-time task.However, the semi-supervised version of transformer, such as Semiformer [51], can help the approach to maintain acceptable accuracy with the limited training data.
Enhancements of CNN-based approach: As conventional approaches, CNN-based techniques still have performance advantage on others.To sustain the leading position, some possible optimizations are worth to be further investigated.Despite typical CNN-based approaches do not regress to points, the success of P2Pnet proved higher performance can be achieved by involving point regression strategy.Secondly, as a primitive problem, the selection of feature in proper scale can be further explored to address the perspective in footage.The SASNet exploits the weight-averaged feature map to lift the performance.Theoretically, the scale selection of feature map can be modelled at pixel-level to obtain the best result.Finally, as the issue of unbalanced prediction is handled at pixel-level in the most recent work, techniques can be continuously probed to further improve accuracy.

Funding
This

Figure 1 (
Figure1(a) Crowd counting with HOG and SVM approach in[9].(b) Crowd with extremely high density which cannot be properly handled by the detection-based approaches

Figure 1 (1
Figure 1 (a) Crowd counting with HOG and SVM approach in [9].(b) Crowd with extremely high density which cannot be properly handled by the detection-based approaches

Figure 2
Figure 2The architecture of regression-based crowd counting approach

Figure 3
Figure 3 Structural evolution of front-end feature extraction networks at early stage (a) MCNN (b) CP-CNN (c) CSRNet The encoder comprises with multiple layers of Multi-head Self-Attention (MSA) and Multilayer Perceptron (MLP).The output  � of the  −  layer can be expressed as Equation(5), where  represents the layer normalization process.� � = �� ��� �� +  ��� �5.1�  � = �� � � �� +  � � , �5.2�where the  calculates  Self Attention models  � with a reprojection matrix  � , expressed as� � � = [ � � � �;  � � � �; … ;  � � � �] � .The  � can be obtained with the typical Query( ) /Key(  )/ Value(  ) paradigm of classic transformer.The  uses 2 linear layers with the GELU [6] activation function to expand and shrink the embedding dimension of the feature.� � � = ���� � � � �� � � � √ �  � �6� The obtained  � is further processed with either Regression Token or Global Average Pooling before sent to the regression head.The regression token procedure attaches an additional token  � � to  � .After the  and  , the  � � contains the global for the l-th layer.For the encoding the Twins network [18] is ada Twins can perceive both local a receptive fields via alternated global attentions, namely Spatially Self-Attention (SSSA) module.Sp for the local attention, the Locally Self-Attention (LSA) and MLP are � ��� � .For the global attentio Sub-sampled Attention (GSA) and further applied to obtain th sequence  � for regressio Comparing with the TransCrowd, local attention of the SSSA additional perceptions on local fea CCTrans ranked first on the onli NWPU-Crowd in year 2021.

[ 18 ]
also attempts to incorpo attention into the transform.U TransCrowd and CCTrans, the M used VGG19 to obtain the feat 1, … , } is integrated as well to generate the input  � for the encoding process.
The  � can be obtained with the typical Query( ) /Key(  )/ Value(  ) paradigm of classic transformer.The  uses 2 linear layers with the GELU [6] activation function to expand and shrink the embedding dimension of the feature.
Twins can perceive both local and global receptive fields via alternated local and global attentions, namely Spatially Separable Self-Attention (SSSA) module.Specifically, for the local attention, the Locally-grouped Self-Attention (LSA) and MLP are applied to � ��� � .For the global attention, Global Sub-sampled Attention (GSA) and MLP are further applied to obtain the feature sequence  � for regression head.Comparing with the TransCrowd, integrated local attention of the SSSA provides additional perceptions on local features.The CCTrans ranked first on the online dataset NWPU-Crowd in year 2021.

[ 18 ]
also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly

Figure 4
Figure 4 Structures of transformer-based approaches (a) TransCrowd (b) CCTrans (c) MAN The  � can be obtained with the typical Query( ) /Key(  )/ Value(  ) paradigm of classic transformer.The  uses 2 linear layers with the GELU [6] activation function to expand and shrink the embedding dimension of the feature.
additional token  � � to  � .After the  and  , the  � � contains the global semantic crowd information.This strategy is adapted by Bert [5] as well.Applying the global average pooling to  � will generated the pooled visual tokens  � � .Since the  � � has more discriminative semantic patterns, it obtained better performance than using  � � .Similarly, CCTrans [44] also adapted transformer as the feature extraction backbone.Unlike TransCrowd, patches  � of the input image are flattened into a single sequence, then a learnable projection is applied to obtain the input sequence  ��� for the l-th layer.For the encoding backbone, the Twins network [18] is adapted.The Twins can perceive both local and global receptive fields via alternated local and global attentions, namely Spatially Separable Self-Attention (SSSA) module.Specifically, for the local attention, the Locally-grouped Self-Attention (LSA) and MLP are applied to � ��� � .For the global attention, Global Sub-sampled Attention (GSA) and MLP are further applied to obtain the feature sequence  � for regression head.Comparing with the TransCrowd, integrated local attention of the SSSA provides additional perceptions on local features.The CCTrans ranked first on the online dataset NWPU-Crowd in year 2021.
The  � can be obtained with the typical Query( ) /Key(  )/ Value(  ) paradigm of classic transformer.The  uses 2 linear layers with the GELU [6] activation function to expand and shrink the embedding dimension of the feature.
The  � can be obtained with the typical Query( ) /Key(  )/ Value(  ) paradigm of classic transformer.The  uses 2 linear layers with the GELU [6] activation function to expand and shrink the embedding dimension of the feature.

[ 18 ]
also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
additional token  � � to  � .After the  and  , the  � � contains the global semantic crowd information.This strategy is adapted by Bert [5] as well.Applying the global average pooling to  � will generated the pooled visual tokens  � � .Since the  � � has more discriminative semantic patterns, it obtained better performance than using  � � .Similarly, CCTrans [44] also adapted transformer as the feature extraction backbone.Unlike TransCrowd, patches  � of the input image are flattened into a single sequence, then a learnable projection is applied to obtain the input sequence  ��� ��� Sub-sampled Attention (GSA) and MLP are further applied to obtain the feature sequence  � for regression head.Comparing with the TransCrowd, integrated local attention of the SSSA provides additional perceptions on local features.The CCTrans ranked first on the online dataset NWPU-Crowd in year 2021. � � = �� ��� �� +  ��� �7.also attempts to incorporate local attention into the transform.Unlike the TransCrowd and CCTrans, the MAN firstly used VGG19 to obtain the feature map.
additional token  � � to  � .After the  and  , the  � � contains the global semantic crowd information.This strategy is adapted by Bert [5] as well.Applying the global average pooling to  � will generated the pooled visual tokens  � � .Since the  � � has more discriminative semantic patterns, it obtained better performance than using  � � .Similarly, CCTrans [44] also adapted transformer as the feature extraction backbone.Unlike TransCrowd, patches  � of the input image are flattened into a single sequence, then a learnable projection is applied to obtain the input sequence  ��� Self-Attention (LSA) and MLP are applied to � ��� � .For the global attention, Global Sub-sampled Attention (GSA) and MLP are further applied to obtain the feature sequence  � for regression head.Comparing with the TransCrowd, integrated local attention of the SSSA provides additional perceptions on local features.The CCTrans ranked first on the online dataset NWPU-Crowd in year 2021.

Figure 6
Figure 6Rescaling process of Density map in ASNet[14] sequences instead of density maps,  � and  �� are replaced with the predicted count  � and actual count  �� .As claimed in Section 1, we note the density map generated loss as Local Loss, and the count generated loss as Global Loss.

4 . 2
Loss and claims to obtain better performance than the ordinary  � .  � = � || − 0.5, || > 1 0.5 � , || < 1 .�13� Pyramid Loss As mentioned in Section 3, the unbalanced density will impact the accuracy of count prediction.Regions with identical density level still have different density values.This issue also affects the loss calculation while training.Therefore, the ASNet [14] devised the Adaptive Pyramid Loss (APL) to handle the unbalanced density within the predicted density map.By iteratively dividing the map with count larger than threshold  into 4 subregions, the APL  �� can measure the difference more accurately in extremely crowded regions.As Figure 7 illustrated, the  �� can be defined as Therefore, the appropriate selection of Loss function can directly boost the accuracy of prediction.The straight-forward way is calculating the  � norm between the estimated density map  � and GT density map  �� , which can be expressed as ||  � −  �� || � or the following equation, where  is the total amount of images for training.
Figure 7 Adaptive Pyramid Loss

𝑙𝑙𝑙𝑙 𝑅𝑅𝑅𝑅 𝑖𝑖𝑖𝑖 1 ,
map.The Bayesian Loss (BL)[26] exploits the probability of every spatial location belongs to the head annotation, to calculate the Bayesian Loss

dataset, which contains 4 ,
372 images and 1.51 million annotations.This dataset is divided into 2,772 training and 1,600 testing images.Images of JHU-Crowd++ are carefully collected according to adverse weather conditions.The WorldExpo'10 [55] is a video dataset, where partial frames are annotated with 199,923 pedestrians.Its training set contains 3380 frames from 103 scenes, and its testing set contains 600 frames from 5 scenes.
the  calculates  Self Attention models  � with a reprojection matrix  � , , image is segmented into multiple density regions with the Density Attention Network (DANet) to generate the attention masks [M 1 , M 2 , …, M n ].Once D E is generated with the feature extraction network, the Attention Scaling Network (ASNet) will map regions with scaling factors [s 1 , s 2 , …, s n ].With the attention mask, region with high density level is set lower to give more accurate prediction, and vice versa.
subregions  � ~� in the first iteration  � , Some approaches such as SASNet adapt L 2 norm ||D E -D GT || 2 as the loss.function can directly boost the accuracy of prediction.The straight-forward way is calculating the  � norm between the estimated density map  � and GT density map  �� , which can be expressed as ||  � −  �� || � or the following equation, where  is the total amount of images for training. � norm ||  � −  �� || � as the loss.
(19)g Bayesian Loss.To suppress the false positive prediction, MAN proposed Instance Attention Loss L IA by adapting the instance attention mask to prune the L Bayes value larger than threshold δ.The L IA can be expressed as.  =  〈,   〉 − (  ) +  1 +  (19)here  is the transport cost, by minimizing the 〈,   〉 , the predicted density is pushed

Table 1
Patterns of mainstream datasets for crowd counting ages are captured in darkness, which makes it a fine choice to evaluate performance under limited brightness circumstance.
Some metrics are devised to provide a more general presentation of basic metrics.For example, MAE and GAME can be special cases of Patch Mean Absolute Error (PMAE), and MSE is a special case of Patch are also adapted in some cases, such as Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM).However, MAE and MSE are still at dominate position and adapted by nearly all approaches.5.3 Performances and DiscussionsPerformances of the frequently cited

Table 2
(2)]e SASNet also has a standard front/back-end network structure.The VGG16 is adapted as the front-end network, whose features from strides with {1,2,4,8,16} are selected as feature maps in 5 scales.Like P2Pnet, the back-end network of SASNet is also dual-heads.The map from each scale level is fed to confidence head and regression head respectively.The regression head of SASNet predicts the density map.The confidence head generates confidence map, which indicates if the current scale level can most properly describe the actual density.The confidence and density maps in each scale can be further modeled into the final density map.Furthermore, the Pyramid Region Awareness Loss (PRAL) is devised to handle the unbalance density prediction, which is explained in previous Loss section.The final Loss is obtained with the summation of the density Loss, confidence Loss and PRAL.Overall, the SASNet applies a conventional density map regression-based strategy with a composite loss.The ablation experiment reveals 2 crucial factors for SASNet to achieve such high performance.(1)Asintroducedintheparagraphabove,thefront-end network extracts feature maps in 5 scales.If only the average feature map is adapted for regression, the MAE on SHT-A is 57.48.This MAE is at same level as S3[17](57) and MAN(56.8).If the feature map with highest confidence score is adapted, the MAE will be improved to 55.71.By applying weight average on all 5 maps with confidence scores, the final feature map is generated and adapted for regression.The result shows the MAE is boosted to 54.75.This experiment proves optimizing the scale selection of feature map is still a feasible path to further improve the performance, and this optimizing path can be backtracked to the MCNN in 2017.Besides, by selecting pixels from different feature levels with minimum error, and aggregating into a "ground truth" feature map, the ideal MAE can be reduced to 46.19, which can be considered as a short-term goal for the optimization of scale selection.(2)TheSASNet adapts the Pyramid Region Awareness Loss (PRAL) to handle the unbalanced prediction by pruning the most over/under-estimated pixels out, and integrating their Euclidean Distance Loss into the final Loss.This process lifts the precision of Loss calculation for the pooled feature map to pixel level.As the ablation experiment shows, by involving PRAL into the SASNet, the MAE is further boosted from 54.75 to 53.59.Comparing to the PRAL, the unbalance density handling approaches of L2S and ASNet are coarser since they are not at pixel level.
paper is funded by Key Research and Development Program of Shaanxi, grant number 2023-YBGY-026; National Science Foundation of China, grant number 62071378; Key Projects of Postgraduate Joint Cultivation Workstation of Xi'an University of Posts and Telecommunications, grant number YJGJ201902.