YOLOv7-PD: Incorporating DE-ELAN and NWD-CIoU for Advanced Pedestrian Detection Method

In the pedestrian detection task


Introduction
Pedestrian detection is a crucial detection task within the domain of computer vision and has a broad range of applications in the area of autonomous cars [26,41].Pedestrian detection is a method of automatically identifying the location and dimensions of pedestrians within a picture or video using computer vision methods.Pedestrians are one of the main participants in the road traffic environment, they are not fixed like rigid objects but have changeable shapes.Pedestrians can have different postures and different appearances, and the background environment is also diverse, which makes the difficulty of pedestrian detection greatly increased.At present, pedestrian detection faces the challenges associated with precise recognition and the precise location of small-scale object targets and occluded object targets [14,23,30,36].Over the past few years, YOLOv7 [44] has received widespread acknowledgment due to its outstanding detection speed and high-precision object detection.However, the YOLOv7 continues to face two problems with pedestrian detection.1) Excessively deep convolutional networks in YOLOv7 can lead to an overabundance of background feature information, making it challenging for the model to precisely locate and detect pedestrians in situations involving small scales and severe occlusions.2) For pedestrians of different sizes, the sensitivity of IoU varies greatly.For small target pedestrians, a slight position deviation results in a significant reduction in IoU.
To solve this issue, we present a pedestrian detection model referred to as YOLOv7-PD.We propose an improved E-ELAN module (DE-ELAN), which enhances feature extraction performance and contributes to a better capture of rich contextual information.Then, we propose a lightweight-receptive field enhancement module (light-REFM) to obtain fine-grained multiscale information.Last, we propose an upgraded method for regression loss function to boost the small target pedestrian's detection accuracy.This paper makes an important contribution to solving the above problems, and the summary is as follows:

Related Work
In this section, we explore a variety of issues concerning the content of this paper, containing pedestrian detection, attention mechanism, and feature pyramid.

Pedestrian Detection
Pedestrian detection mainly determines whether there is a pedestrian target through the given static image or dynamic video by the computer [26].If there is a pedestrian, the bounding box is employed to label the specific location of the pedestrian and give the confidence score [6,9].With the vigorous advancement of artificial intelligence technology, pedestrian detection has shown a broad spectrum of application scenarios in the arena of automatic driving, human-computer interaction, intelligent video surveillance, and urban street view [12,29].In the early, pedestrian detection methods for manual feature extraction.Dalal et al. [11] proposed the Histogram of Oriented Gradients (HOG) method, utilizing edge direction and intensity information to characterize the overall visual representation of pedestrians.Generally speaking, manual feature extraction methods have certain advantages in the realm of pedestrian detection, however, the extraction steps are cumbersome.Over the years, deep learning techniques have achieved significant breakthroughs in computer vision tasks [53,54,55,56], especially in pedestrian detection [38,39].Pedestrian detection algorithms that utilize deep learning approaches have improved the accuracy of detection algorithms, which can be sorted as either two-stage detection approaches or one-stage detection approaches [52].The two-stage detection approaches have high accuracy, however the speed of detection is significantly sluggish.The Faster R-CNN proposed by Girshick et al. [18], utilizes the Region Proposal Network (RPN) network to directly produce region proposals.The SA-FastRCNN proposed by Li et al. [25], incorporates a unified structure by combining two parallel large subnetworks into an integrated architecture, supplying confidence scores for different classes and object bounding box regression for various sizes.The one-stage detection approaches have advantages in terms of speed, simplicity, and real-time performance, rendering them appropriate for many computer vision applications, particularly those with high-speed processing requirements.Liu et al. [33] proposed SSD, which is directly detected using CNN.Bochkovskiy et al. [44] proposed YOLOv7.The backbone of the YOLOv7 model is primarily constructed with convolutional layers, E-ELAN modules, and MPConv modules.In particular, the E-ELAN module, building upon the original ELAN, modifies the computation block while preserving the transitional layer architecture of the initial ELAN design.It boosts the network's learning capacity by incorporating the ideas of expanding, shuffling, and merging cardinality without disrupting the existing gradient path.

Attention Mechanism
The attention mechanism is a crucial technology in deep learning that aims to enable models to distribute different attention or weights to different parts of input data, allowing them to concentrate on important information during data processing, thereby improving model performance.The key point of the attention mechanism is similar to the attention focus in human perception processes, enabling adaptive concentration on different parts when processing sequences, images, or other types of data [2,37].Hu et al. [22] proposed SENet, utilizing a breakthrough channel attention module named "Squeeze and Excitation" (SE) to leverage the interdependencies between convolutional feature channels.Taking inspiration from the SE block, the ECA block is proposed by Wang et al. [46], offering a more effective channel attention design by taking the place of the initial fully connected layer of SE with cost-effective 1D convolutions and reducing an extensive set of parameters.Woo et al. [47] proposed a lightweight module of attention called CBAM, combining the channel attention module with the spatial attention module.Misra et al. [35] proposed an attention module that has three branches called Triplet Attention, which is conditioned on features that rotate along three different dimensions.The attention mechanism aids the model in concentrating pivotal regions in the image, thereby enhancing the performance of object detection and localization tasks.By incorporating attention within convolutional layers, the network can autonomously learn to selectively emphasize critical objects or areas.

Feature Pyramid
The scale variation in complex road scenes significantly affects the accuracy of pedestrian detection.The feature pyramid is a multi-scale feature representation method used in computer vision tasks.Its main goal is to process information at different scales and capture both local details and the global context of an object in an image [27].Inspired by SPP [19], Chen et al. [5] proposed the ASPP module, employing multiple parallel dilated convolution layers with varying sampling rates.Features gathered at various sampling rates are subsequently treated in separate branches and merged to produce the ultimate result.The module creates convolution kernels with varying receptive fields by various dilation rates, aiming to capture multi-scale object information.The RFB module was proposed by Songtao Liu et al [31].This module simulates a range of observations similar to human vision to augment the network's feature extraction efficiency.It combines different receptive fields by using different convolutional kernels and different step sizes, uses 1 × 1 convolutions for dimensionality reduction, and ultimately constructs a hybrid superposition of varying receptive fields.

The Proposed Method
Due to differences in the shape, size, and other attributes of pedestrians in complex road environments, as well as possible occlusion, the original YOLOv7 network may not be able to accurately locate and detect pedestrians.Therefore, we propose the YOLOv7-

Figure 1
The overall architecture of YOLOv7-PD PD network.The network initially enhances the backbone of the YOLOv7, proposing the DE-ELAN module based on Omni-dimensional Dynamic Convolution that optimizes the original E-ELAN module.This aims to boost the network's learning capacity for pedestrian targets with different shapes, sizes, and occlusion levels to capture richer feature information.Secondly, in the Head network, we propose the lightweight receptive field enhancement module (light-REFM), which enhances the precision of multi-scale pedestrian detection and identification.Thirdly, the Normalized Wasserstein Distance (NWD) loss is integrated into the regression loss function and combined with the CIoU method to boost the effec-tiveness of YOLOv7 in the small target pedestrian detection task.Figure 1 illustrates the overall architecture of YOLOv7-PD.

E-ELAN Module Based on Omni-dimensional Dynamic Convolution
In pedestrian detection, the posture, shape, and size of pedestrians may vary due to different shooting angles and distances.Thus, by incorporating full-dimensional dynamic convolutions into the backbone network's E-ELAN module and gradually introducing different attention mechanisms in various dimensions (such as position, channel, filters, and convolution kernels) into the convolution operation, the convolution pro-cess can better adapt to differences across various aspects of input data.Consequently, it can more effectively capture rich contextual information for pedestrians of different scales.Therefore, we designed an improved E-ELAN module (DE-ELAN) to replace the last CBS module in the E-ELAN in YOLOv7's feature extraction network with an improved OCBS module.The OCBS module based on Omni-dimensional Dynamic Convolution is displayed in Figure 2.

Figure 2
The structure of OCBS In the OCBS module, we introduce ODConv (Omni-dimensional Dynamic Convolution) [24], which uses a parallel strategy to employ a multi-dimensional attention mechanism along four dimensions in the kernel space for learning more flexible and complementary attention.It simultaneously considers account dynamics across dimensions including spatial, input channels, output channels, etc., to capture rich contextual information.This multi-dimensional manipulation can improve the model's capability to analyze data, enhance its perception of different features, and contribute to better performance on complex tasks.The OCBS can be described in the following: where β ki ϵR represents the attention scalar for the convolutional kernel K i ; β si ϵR k×k , β ci ϵR c in and β fi ϵR c out represent attention points that are applied across the spatial, input channel, and output channel dimensions respectively.These points are calculated across the spatial, input channel, and output channel dimensions of the convolution kernel W i .⊙ represents the multiplication of various dimensions along the kernel space.β si , β ci , β fi and β ki are calculated using a multihead attention model π i (x).Traditional convolution still plays a certain role in pedestrian detection, especially for pedestrians with moderate sizes and relatively normal postures.However, its adaptability is limited when dealing with pedestrians with irregular shapes and significant variations in posture.Therefore, we designed the DE-ELAN module to boost the ability to capture details of small-sized pedestrians.
The DE-ELAN module is depicted in Figure 3.

Figure 3
The structure of DE-ELAN

Lightweight-Receptive Field Enhancement Module
Because different sizes of receptive fields imply different abilities to capture remote dependencies, cover more of the surrounding area, and capture richer contextual information.We propose a lightweight receptive field enhancement module called Light-REFM, which captures the spatial details of the input feature maps on multiple scales through parallel dilated convolution to obtain feature maps that should contain contextual information.The composition of the module is displayed in Figure 4.
Zhang et al. [48] proposed group convolution as a technique to extract features across various scales.The method enabling the processing of channels in groups to capture multi-scale information may suffer from reduced parameter efficiency as the convolution kernel dimensions grow.The erode operation is often used to improve detection results by removing small false detection areas in the image due to factors such as noise or background interference.However, the erode operation may eliminate important feature details when the pedestrian part is not clearly visible or resembles the background.For small target pedes-trians, the erode operation may reduce their effective scale, thereby limiting the flexibility of the model and accuracy in handling pedestrians of different scales.
The dilated convolution keeps the size of the feature map while increasing the spatial resolution of the feature map, which is beneficial for preserving detailed features.Therefore, we employ dilated convolutions with different dilation rates to capture feature information across different scales.In the Light-REFM module, the input feature map f is first partitioned into four parts based on the arrangement order of feature channels, resulting in [F 0 , F 1 , F 2 , F 3 ] grouped by the channel dimension.The quantity of channels in each grouping section is C `=C/4, where C should be a multiple of 4. After each grouping, the resulting feature map is represented as F j ϵ R C`×H×W , where j =0,1,2,3.We retain the original features in the first feature channel group F 0 without introducing additional dilated convolutions, ensuring that the network can capture low-resolution fine details and thus preserve the original information.In the remaining three feature channel groups F 1 , F 2 , and F 3 .We employed dilated convolutions with different dilation rates to The structure of Light-REFM different features at varying scales.Therefore, expanding the receptive field and enabling the network to more effectively capture features of varying scales within the image.Finally, by concatenating the outputs of these four distinct branches, the model merges feature information from different scales and resolutions.This helps the model achieve a stronger representational capacity, thus enabling it to better capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.
where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.Compared to the traditional IoU (Intersection over Union) mea-surement method, the CIoU method takes into account the normalized differences in the center-point distance, width, and height of the bounding box, making it more robust regarding the bounding box position and size [57].However, in pedestrian detection, there are pedestrians with different target sizes.For small target pedestrians, the bounding box is relatively small, and even a small positional deviation can lead to a notable change in the IoU value.Although CIoU loss enhances the stability of IoU loss by considering the distance between the centers of bounding boxes and the aspect ratio, these improvements may still not be enough to completely overcome the instability of the IoU.For small targets, a slight offset in the bounding box can lead to a significant increase in the loss value, making the model overly sensitive to the position of small targets, thereby affecting the stability of training and the final performance of the model.In addition, sensitivity to size changes in CIoU loss is particularly prominent on small targets.This may lead the model to be overly sensitive to size changes when dealing with small targets while ignoring other important features such as appearance and contextual information.The NWD is a novel approach for small object detection [45], which uses a novel metric based on Wasserstein distance to calculate bounding box similarity, substituting the conventional IoU measurement method.First, this method models bounding boxes by utilizing two-dimensional Gaussian distributions.Then, Wasserstein distances are used to calculate the resemblance between the corresponding Gaussian distributions.Because this distance can calculate the similarity of distributions even when overlaps are ignored.Furthermore, this method is insensitive to multi-scale objects, making it well-suited for measuring small objects.
We integrate the NWD into the regression loss function alongside the CIoU method to harness the strengths of both in pedestrian detection.CIoU excels in measuring spatial location and overlaps, making it particularly effective for medium to large-scale pedestrians where precision in bounding box alignment is crucial.On the other hand, NWD demonstrates remarkable robustness in handling small-scale pedestrians by effectively capturing the distributional characteristics of the target without being overly sensitive to scale.Its strength lies in recognizing the subtle distributional nuances of small targets, often overlooked by conventional meth-ods.Therefore, by combining CIoU's precision in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.
Compared to the traditional IoU (Intersection over Union) measurement method, the CIoU method takes into account the normalized differences in the center-point distance, width, and height of the bounding box, making it more robust regarding the bounding box position and size [57].However, in pedestrian detection, there are pedestrians with different target sizes.For small target pedestrians, the bounding box is relatively small, and even a small positional deviation can lead to a notable change in the IoU value.Although CIoU loss enhances the stability of IoU loss by considering the distance between the centers of bounding boxes and the aspect ratio, these improvements may still not be enough to completely overcome the instability of the IoU.For small targets, a slight offset in the bounding box can lead to a significant increase in the loss value, making the model overly sensitive to the position of small targets, thereby affecting the stability of training and the final performance of the model.In addition, sensitivity to size changes in CIoU loss is particularly prominent on small targets.This may lead the model to be overly sensitive to size changes when dealing with small targets while ignoring other important features such as appearance and contextual information.The NWD is a novel approach for small object detection [45], which uses a novel metric based on Wasserstein distance to calculate bounding box similarity, substituting the conventional IoU measurement method.First, this method models bounding boxes by utilizing twodimensional Gaussian distributions.
Then, Wasserstein distances are used to calculate the resemblance between the corresponding Gaussian distributions.Because this distance can calculate the similarity of distributions even when overlaps are ignored.Furthermore, this method is insensitive to multi-scale objects, making it well-suited for measuring small objects.
We integrate the NWD into the regression loss function alongside the CIoU method to harness the strengths of both in pedestrian detection.CIoU excels in measuring spatial location and overlaps, making it particularly effective for medium to largescale pedestrians where precision in bounding box alignment is crucial.On the other hand, NWD demonstrates remarkable robustness in handling small-scale pedestrians by effectively capturing the distributional characteristics of the target without being overly sensitive to scale.Its strength lies in recognizing the subtle distributional nuances of small targets, often overlooked by conventional methods.Therefore, by combining CIoU's precision in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: where Lreg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box bgt.ρ 2 (b,bgt) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
In Equation ( 5), w and h represent the height and width of the prediction box.wgt and hgt represent the height and width of the ground truth box.The NWD loss function is specified as: where c is a constant that has a strong correlation with the dataset, W2 2(a, b) is a distance metric. ( where L reg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.
Compared to the traditional IoU (Intersection over Union) measurement method, the CIoU method takes into account the normalized differences in the center-point distance, width, and height of the bounding box, making it more robust regarding the bounding box position and size [57].However, in pedestrian detection, there are pedestrians with different target sizes.For small target pedestrians, the bounding box is relatively small, and even a small positional deviation can lead to a notable change in the IoU value.Although CIoU loss enhances the stability of IoU loss by considering the distance between the centers of bounding boxes and the aspect ratio, these improvements may still not be enough to completely overcome the instability of the IoU.For small targets, a slight offset in the bounding box can lead to a significant increase in the loss value, making the model overly sensitive to the position of small targets, thereby affecting the stability of training and the final performance of the model.In addition, sensitivity to size changes in CIoU loss is particularly prominent on small targets.This may lead the model to be overly sensitive to size changes when dealing with small targets while ignoring other important features such as appearance and contextual information.The NWD is a novel approach for small object detection [45], which uses a novel metric based on Wasserstein distance to calculate bounding box similarity, substituting the conventional IoU measurement method.First, this method models bounding boxes by utilizing twodimensional Gaussian distributions.
Then, Wasserstein distances are used to calculate the resemblance between the corresponding Gaussian distributions.Because this distance can calculate the similarity of distributions even when overlaps are ignored.Furthermore, this method is insensitive to multi-scale objects, making it well-suited for measuring small objects.
We integrate the NWD into the regression loss function alongside the CIoU method to harness the strengths of both in pedestrian detection.CIoU excels in measuring spatial location and overlaps, making it particularly effective for medium to largescale pedestrians where precision in bounding box alignment is crucial.On the other hand, NWD demonstrates remarkable robustness in handling small-scale pedestrians by effectively capturing the distributional characteristics of the target without being overly sensitive to scale.Its strength lies in recognizing the subtle distributional nuances of small targets, often overlooked by conventional methods.Therefore, by combining CIoU's precision in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: where Lreg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box bgt.ρ 2 (b,bgt) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
In Equation ( 5), w and h represent the height and width of the prediction box.wgt and hgt represent the height and width of the ground truth box.The NWD loss function is specified as: where c is a constant that has a strong correlation with the dataset, W2 2(a, b) is a distance metric. ( where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box b gt .ρ 2 (b,b gt ) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.
Compared to the traditional IoU (Intersection over Union) measurement method, the CIoU method takes into account the normalized differences in the center-point distance, width, and height of the bounding box, making it more robust regarding the bounding box position and size [57].However, in pedestrian detection, there are pedestrians with different target sizes.For small target pedestrians, the bounding box is relatively small, and even a small positional deviation can lead to a notable change in the IoU value.Although CIoU loss enhances the stability of IoU loss by considering the distance between the centers of bounding boxes and the aspect ratio, these improvements may still not be enough to completely overcome the instability of the IoU.For small targets, a slight offset in the bounding box can lead to a significant increase in the loss value, making the model overly sensitive to the position of small targets, thereby affecting the stability of training and the final performance of the model.In addition, sensitivity to size changes in CIoU loss is particularly prominent on small targets.This may lead the model to be overly sensitive to size changes when dealing with small targets while ignoring other important features such as appearance and contextual information.The NWD is a novel approach for small object detection [45], which uses a novel metric based on Wasserstein distance to calculate bounding box similarity, substituting the conventional IoU measurement method.First, this method models bounding boxes by utilizing twodimensional Gaussian distributions.
Then, Wasserstein distances are used to calculate the resemblance between the corresponding Gaussian distributions.Because this distance can calculate the similarity of distributions even when overlaps are ignored.Furthermore, this method is insensitive to multi-scale objects, making it well-suited for measuring small objects.
We integrate the NWD into the regression loss function alongside the CIoU method to harness the strengths of both in pedestrian detection.CIoU excels in measuring spatial location and overlaps, making it particularly effective for medium to largescale pedestrians where precision in bounding box alignment is crucial.On the other hand, NWD demonstrates remarkable robustness in handling small-scale pedestrians by effectively capturing the distributional characteristics of the target without being overly sensitive to scale.Its strength lies in recognizing the subtle distributional nuances of small targets, often overlooked by conventional methods.Therefore, by combining CIoU's precision in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: where Lreg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box bgt.ρ 2 (b,bgt) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
In Equation ( 5), w and h represent the height and width of the prediction box.wgt and hgt represent the height and width of the ground truth box.The NWD loss function is specified as: where c is a constant that has a strong correlation with the dataset, W2 2(a, b) is a distance metric. ( In Equation ( 5), w and h represent the height and width of the prediction box.w gt and h gt represent the height and width of the ground truth box.The NWD loss function is specified as: capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.
Compared to the traditional IoU (Intersection over Union) measurement method, the CIoU method takes into account the normalized differences in the center-point distance, width, and height of the bounding box, making it more robust regarding the bounding box position and size [57].However, in pedestrian detection, there are pedestrians with different target sizes.For small target pedestrians, the bounding box is relatively small, and even a small positional deviation can lead to a notable change in the IoU value.Although CIoU loss enhances the stability of IoU loss by considering the distance between the centers of bounding boxes and the aspect ratio, these improvements may still not be enough to completely overcome the instability of the IoU.For small targets, a slight offset in the bounding box can lead to a significant increase in the loss value, making the model overly sensitive to the position of small targets, thereby affecting the stability of training and the final performance of the model.In addition, sensitivity to size changes in CIoU loss is particularly prominent on small targets.This may lead the model to be overly sensitive to size changes when dealing with small targets while ignoring other important features such as appearance and contextual information.The NWD is a novel approach for small object detection [45], which uses a novel metric based on Wasserstein distance to calculate bounding box similarity, substituting the conventional IoU measurement method.First, this method models bounding boxes by utilizing twodimensional Gaussian distributions.
Then, Wasserstein distances are used to calculate the resemblance between the corresponding Gaussian distributions.Because this distance can calculate the similarity of distributions even when overlaps are ignored.Furthermore, this method is insensitive to multi-scale objects, making it well-suited for measuring small objects.
We integrate the NWD into the regression loss function alongside the CIoU method to harness the strengths of both in pedestrian detection.CIoU excels in measuring spatial location and overlaps, making it particularly effective for medium to largescale pedestrians where precision in bounding box alignment is crucial.On the other hand, NWD demonstrates remarkable robustness in handling small-scale pedestrians by effectively capturing the distributional characteristics of the target without being overly sensitive to scale.Its strength lies in recognizing the subtle distributional nuances of small targets, often overlooked by conventional methods.Therefore, by combining CIoU's precision in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: where Lreg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box bgt.ρ 2 (b,bgt) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
In Equation ( 5), w and h represent the height and width of the prediction box.wgt and hgt represent the height and width of the ground truth box.The NWD loss function is specified as: where c is a constant that has a strong correlation with the dataset, W2 2(a, b) is a distance metric. ( capture multi-scale information and contextual relationships.Simultaneously, it controls the number of parameters, avoiding excessive computational burden.The entire multi-scale feature map F is: where Fϵ R C`×H×W denotes the resulting multi-scale feature map, and contact denotes the act of concatenation.

Regression Loss Function Design
In YOLOv7, the CIoU method is used as an evaluation metric to compute the overlap region between the predicted box and the ground truth box.
Compared to the traditional IoU (Intersection over Union) measurement method, the CIoU method takes into account the normalized differences in the center-point distance, width, and height of the bounding box, making it more robust regarding the bounding box position and size [57].However, in pedestrian detection, there are pedestrians with different target sizes.For small target pedestrians, the bounding box is relatively small, and even a small positional deviation can lead to a notable change in the IoU value.Although CIoU loss enhances the stability of IoU loss by considering the distance between the centers of bounding boxes and the aspect ratio, these improvements may still not be enough to completely overcome the instability of the IoU.For small targets, a slight offset in the bounding box can lead to a significant increase in the loss value, making the model overly sensitive to the position of small targets, thereby affecting the stability of training and the final performance of the model.In addition, sensitivity to size changes in CIoU loss is particularly prominent on small targets.This may lead the model to be overly sensitive to size changes when dealing with small targets while ignoring other important features such as appearance and contextual information.The NWD is a novel approach for small object detection [45], which uses a novel metric based on Wasserstein distance to calculate bounding box similarity, substituting the conventional IoU measurement method.First, this method models bounding boxes by utilizing twodimensional Gaussian distributions.
Then, Wasserstein distances are used to calculate the resemblance between the corresponding Gaussian distributions.Because this distance can calculate the similarity of distributions even when overlaps are ignored.Furthermore, this method is insensitive to multi-scale objects, making it well-suited for measuring small objects.
We integrate the NWD into the regression loss function alongside the CIoU method to harness the strengths of both in pedestrian detection.CIoU excels in measuring spatial location and overlaps, making it particularly effective for medium to largescale pedestrians where precision in bounding box alignment is crucial.On the other hand, NWD demonstrates remarkable robustness in handling small-scale pedestrians by effectively capturing the distributional characteristics of the target without being overly sensitive to scale.Its strength lies in recognizing the subtle distributional nuances of small targets, often overlooked by conventional methods.Therefore, by combining CIoU's precision in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: where Lreg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box bgt.ρ 2 (b,bgt) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
In Equation ( 5), w and h represent the height and width of the prediction box.wgt and hgt represent the height and width of the ground truth box.The NWD loss function is specified as: where c is a constant that has a strong correlation with the dataset, W2 2(a, b) is a distance metric., map, and contact denotes the act of nation.
ression Loss Function Design LOv7, the CIoU method is used as an ion metric to compute the overlap region n the predicted box and the ground truth box.red to the traditional IoU (Intersection over measurement method, the CIoU method to account the normalized differences in the point distance, width, and height of the ng box, making it more robust regarding the ng box position and size [57].However, in ian detection, there are pedestrians with t target sizes.For small target pedestrians, unding box is relatively small, and even a positional deviation can lead to a notable in the IoU value.Although CIoU loss es the stability of IoU loss by considering the e between the centers of bounding boxes and ect ratio, these improvements may still not be to completely overcome the instability of the r small targets, a slight offset in the bounding n lead to a significant increase in the loss making the model overly sensitive to the of small targets, thereby affecting the of training and the final performance of the In addition, sensitivity to size changes in oss is particularly prominent on small targets.ay lead the model to be overly sensitive to anges when dealing with small targets while g other important features such as appearance ntextual information.The NWD is a novel h for small object detection [45], which uses l metric based on Wasserstein distance to te bounding box similarity, substituting the tional IoU measurement method.First, this models bounding boxes by utilizing twoional Gaussian distributions.
Then, stein distances are used to calculate the lance between the corresponding Gaussian tions.Because this distance can calculate the ity of distributions even when overlaps are .Furthermore, this method is insensitive to cale objects, making it well-suited for ing small objects.
tegrate the NWD into the regression loss n alongside the CIoU method to harness the s of both in pedestrian detection.CIoU in measuring spatial location and overlaps, it particularly effective for medium to largeedestrians where precision in bounding box in spatial alignment with NWD's sensitivity to distributional attributes, we significantly boost the model's robustness and accuracy across multiple scales.This fusion allows for a more comprehensive and nuanced capture of the position and features of pedestrians, significantly boosting detection performance, especially in scenarios where targets vary widely in size.The formulation of this regression loss function is expressed as: where L reg is the regression loss function, γ is a weighting factor used to equilibrium the contribution of NWD and CIoU in the regression loss.The value of γ is usually between 0 and 1.It has been verified that the optimal result is achieved when γ = 0.7.The CIoU loss function is specified as: where IoU is used to assess the level of overlap between the predicted bounding box b and the ground truth bounding box b gt .ρ 2 (b,b gt ) calculates the Euclidean distance between the center point of the predicted box and the ground truth box.c denotes the diagonal distance of the smallest enclosed area that can encompass both the prediction box and the ground truth box.α serves as the trade-off parameter, which employs the importance of the center distance in the loss function.v serves as the function that quantifies the aspect ratio metric.
In Equation ( 5), w and h represent the height and width of the prediction box.w gt and h gt represent the height and width of the ground truth box.The NWD loss function is specified as: where c is a constant that has a strong correlation with the dataset, W2 2( a ,  b ) is a distance metric. ( where c is a constant that has a strong correlation with the dataset, W2 2( a ,  b ) is a distance metric. a ,  b is a Gaussian distribution modeled by boundary boxes A(cx a , cy a , w a , h a ) and B(cx b , cy b , w b , h b ).

Experiments
In this section, we utilize three various datasets from CityPersons [50], Caltech [13], and CrowdHuman [40] for training, validation, and testing of our model.We conducted ablation experiments to compute the performance of our proposed method.Finally, we compared our method with the most advanced pedestrian detection methods.

Dataset and Evaluation Indicators
The Citypersons dataset is a portion of the CityScapes dataset [10], containing annotations for pedestrian objects in images captured near roads using onboard vehicle cameras.This dataset includes street scenes captured in 27 different cities, featuring diverse pedestrian samples.It is partitioned into a training set including 2975 images, 500 images for a validation set, and 1525 images for a test set.The Citypersons dataset is displayed in Figure 5.
CrowdHuman dataset is a dataset for pedestrian detection released by Megvii (Face++) and contains mostly images obtained from Google searches.Compared to other datasets, the CrowdHuman dataset exhibits a denser characteristic.The average number of objects per picture is significantly higher in this dataset compared to other datasets.
This dataset comprises 15,000 training pictures, 4,370 validation pictures, and 5,000 testing pictures.The training and validation images contained 470,000 instances, each picture contains an average of 23 individuals.However, the test pictures had no accompanying annotations.The Crowdhuman dataset is displayed in Figure 6.
The Caltech dataset is one of the datasets used for pedestrian detection tasks.This dataset is constitutive of 640x640 resolution, 30Hz videos captured in various scenes, including urban streets and campuses, etc.The training set comprises 42,782 images, and the test set comprises 4024 images.Figure 7 illustrates the Caltech dataset.We use log-average miss rate based on false positives per image (FPPI) to calculate the proposed method's performance, which is determined by computing the geometric average of miss rates at 9 evenly spaced FPPI thresholds within the logarithmic range, specifically in the range of FPPI values from 0.01 to 1, the log-average miss rate determined by computing the mean is referred to as MR -2 .A lower value indicates superior model performance.When assessing model accuracy, the paper employs Precision (P), Recall (R), Average Precision (AP) index, AP@0.5(average accuracy at IoU=0.5), AP@0.5:0.95 (IoU between 0.5-0.95,step size 0.5), these equations are as shown below: . , .
We use log-average miss rate based on false positives per image (FPPI) to calculate the proposed method's performance, which is determined by computing the geometric average of miss rates at 9 evenly spaced FPPI thresholds within the logarithmic range, specifically in the range of FPPI values from 0.01 to 1, the log-average miss rate determined by computing the mean is referred to as MR -2 .A lower value indicates superior model performance.When assessing model the paper employs Precision (P), Recall (R), Average Precision (AP) index, AP@0.5(average accuracy at IoU=0.5), AP@0.5:0.95 (IoU between 0.5-0.95,step size 0.5), these equations are as shown below: yPersons dataset [50] under IoU=0.5,MR -2 lower is better .
We use log-average miss rate based on false positives per image (FPPI) to calculate the proposed method's performance, which is determined by computing the geometric average of miss rates at 9 evenly spaced FPPI thresholds within the logarithmic range, specifically in the range of FPPI values from 0.01 to 1, the log-average miss rate determined by computing the mean is referred to as MR -2 .A lower value indicates superior model performance.When assessing model accuracy, the paper employs Precision (P), Recall (R), Average Precision (AP) index, AP@0.5(average accuracy at IoU=0.5), AP@0.5:0.95 (IoU between 0.5-0.95,step size 0.5), these equations are as shown below: yPersons dataset [50] under IoU=0.5,MR -2 lower is better (10) We use log-average miss rate based on false positives per image (FPPI) to calculate the proposed method's performance, which is determined by computing the geometric average of miss rates at 9 evenly spaced FPPI thresholds within the logarithmic range, specifically in the range of FPPI values from 0.01 to 1, the log-average miss rate determined by computing the mean is referred to as MR -2 .A lower value indicates superior model performance.When assessing model accuracy, the paper employs Precision (P), Recall (R), Average Precision (AP) index, AP@0.5(average accuracy at IoU=0.5), AP@0.5:0.95 (IoU between 0.5-0.95,step size 0.5), these equations are as shown below: ons dataset [50] under IoU=0.5,MR -2 lower is better (11) Where TP refers to True Positive, FP refers to False Positive, FN refers to False Negative.

Experimental Setup
The training method presented in this paper is as depicted: The initial learning rate is set to 10 -2 , the learning rate is adjusted through cosine annealing decay, the optimization is performed using the SGD optimizer, and the batch size is configured as 32.The momentum is 0.937, and decay is configured as 0.0005.During the data preprocessing stage, random cropping is applied to the original images to obtain fixedsized input images, followed by scaling and padding of the processed images.

Ablation Studies
In this section, to assess the influence of added components on the overall model and its effectiveness, we conducted ablation experiments utilizing the Citypersons datasets and evaluated its performance using MR -2 .The outcomes of the ablation test results for every component are displayed in Table 1.
To evaluate the impact of combining NWD with CIoU on pedestrian detection performance, and to find the optimal way of combining them, we tried for the first time to use NWD as a loss function instead of the traditional IoU.However, this change did not lead to an improvement in any performance.Therefore, we decided to continue using CIoU while fine-tuning it by adjusting the weight ratio between IoU loss and NWD.To achieve this objective, we designed six sets of comparison experiments and recorded the results in Table 2.
From Table 2, it is evident that different ratio relationships have meaningful effects on the detection performance of YOLOv7.The best detection results are achieved when the weight ratio of CIoU Loss to NWD is set to 0.7 and 0.3, respectively.
To verify the superiority of combining NWD and CIoU as the regression loss function, we employed several prevalent loss functions for comparison with our method on the YOLOv7 model.The experimental results are presented in Table 3.These results clearly illustrate that integrating NWD and CIoU into the regression loss function can lead to a substantial enhancement in the model ability, effectively proving the validity of our method.Furthermore, to achieve better compute the impact of addition components on the overall model accuracy.Figure 8 shows Precision, Recall, P-Rcurve, mAP@0.5, mAP@0.5:0.95,box loss, and object loss results obtained by our training on the Citypersons dataset.Compared to YOLOv7, YOLOv7-PD shows an improvement of 5.51% in Precision, a 7.71% improvement in Recall, a 7.01% improvement in mAP@0.5, and a 6.88% improvement in mAP@0.5:0.95.YOLOv7-PD's box loss is consistently lower than the other three versions, indicating that YOLOv7-PD is more precise in locating the bounding boxes of objects.
During training, YOLOv7-PD's object loss decreases relatively quickly and remains at a low.A lower box loss typically suggests that the model performs well The visualization outcomes of Grad-CAM in predicting the size and position of the targets.This indicates that YOLOv7-PD is more accurate in determining the presence of objects in images.
In Figure 9, we verify the effectiveness of YOLOv7-PD using Grad-CAM.The results are computed from the penultimate convolutional layer.To the left of every input image, the ground truth labels are displayed, and the distribution of the heatmap exhibits the distribution of interests within the network.We compared the

Input images YOLOv7
YOLOv7-PD visualizations of YOLOv7 and YOLOv7-PD, and these results were obtained by analyzing the penultimate convolutional layer.To the left side of every input picture, we displayed the ground truth labels, while the distribution of the heatmap reflected the spatial distribution of areas of interest within the network.The research findings indicate that YOLOV7-PD places more emphasis on the center of a pedestrian's torso.Furthermore, for smaller pedestrian targets, the focus area typically covers the entire body region.Compared to YOLOv7, our method exhibits superior performance in learning small pedestrian targets.

Comparison with the State-of-the-Art
In this section, we will assess the performance of the YOLOv7-PD model with some representative models on the Caltech, CityPersons, and CrowdHuman datasets.We will use consistent parameters, environment, and data processing techniques to compute the respective evaluation metrics.
For the purpose of validating the detection capabilities of YOLOv7-PD on pedestrian targets of varying scales, we chose several outstanding pedestrian detection models for comparison on subsets of the Caltech dataset separated into different scales.These include MSCM-ANet [34], which is designed with multi-scale convolution modules for extracting features at varying scales.RPN+BF [49] utilizes a Region Proposal Network to propose regions of interest, which are then processed by the backbone feature network to extract features relevant to subsequent object detection or classification tasks.TA-CNN [42] reduces the variance between datasets through a multi-task deep model.MHN [4] proposes a multibranch and high-level semantic network that uses cross-layer connections to add context to a relatively smaller receptive field branch.It also incorporates and incorporates dilated convolutions to increase the output feature map's resolution.The Coupled Network [32] uses a gated multi-layer feature extraction subnetwork and a deformable region of interest pool to deal with occlusion issues in pedestrian detection.MS-CNN [3] proposes a multi-scale neural network for fast multi-scale target detection.SA-Fas-tRCNN [25] solves the problem of multi-scale target detection by jointly training two networks for the detection of both large and small pedestrian targets.As displayed in Figure 10, when we tested on the reason-able subset of Caltech, YOLOv7-PD achieved an MR value of 15%, which is 2% lower than the second-best model.On the subsets with pedestrian heights of (30,80) and (80,inf ), the MR values are 18% and 3% individually, which are at the forefront when compared to other methods.The MR value on the subset of pedestrian heights of (0, 30) is 21%, which is 1% lower than the sub-optimal model, demonstrating excellent performance.This achievement is mainly attributed to our design of the Light-REFM module, which finely captures multi-scale information by building a pyramid structure and employing dilation convolutions of different sizes.This structure enables the model to effectively recognize and process targets at different scales and is particularly good at detecting small targets and partial occlusion situations.This combination not only significantly improves the model's localization capability.In addition, we adopt a refined regression loss function derived from NWD and combine it with CIoU, an innovation that further enhances the model's performance concerning localization accuracy.But is also particularly effective for accurate detection and feature capture of small targets, thus ensuring efficient performance in complex scenes.
To evaluate the detection capability of YOLOv7-PD on pedestrian targets with different occlusion levels, we partitioned the CityPersons dataset into four subsets based on the level of occlusion.Reasonable subset: Pedestrians in this subset have a visibility range from 65% to 100%, encompassing individuals who are mostly visible with possible partial occlusions.Heavy Subset: The subset includes pedestrians with visibility less than 65%, representing significant occlusion or very challenging detection conditions.Partial subset: This subset is for pedestrians with visibility between 65% and 90%, indicating moderate occlusion.Bare subset: This subset includes pedestrians who are almost entirely visible, with visibility ranging between 95% and 100%.Using the same parameters, we compared YOLOv7-PD with the above methods.The results are presented in Table 4, In the bare subset, our method MR -2 achieves 6.7%.In the partial occlusion subset, our method MR -2 reaches 22.5%.In the heavy occlusion subset, our method MR -2 reaches 36.2%.The results indicate that YOLOv7-PD plays a significant role in detecting occluded pedestrian objects.In Table 5, we conducted AP values and FPS tests on YOLOv7-PD and some single-stage models to assess the models' effectiveness in pedestrian detection.In contrast to other single-stage methods, YOLOv7-PD showed a higher AP value, reaching 69.3%.Its FPS is 96.Our method shows significant advantages in different performance metrics, demonstrating that our method achieves excellent detection accuracy while sustaining a high processing speed.
To validate the detection capabilities of YOLOv7-PD for occluded objects, we opted for the CrowdHuman dataset, known for its abundance of occluded objects.We compared YOLOv7-PD against advanced methods, including RetinaNet [28], YOLOX [16], YOLO-CPD [15], GossipNet [20], RelationNet [21], MIP [8].For assessment, we employed MR -2 and AP for evaluation, as depicted in Table 6.Our method AP achieves 83.36% and MR -2 achieves 41.2%, both of which reached the advanced level in the field of similar detection.
YOLOv7-PD achieves this result mainly due to the adoption of the improved DE-ELAN module, which utilizes ODConv and four complementary attentional mechanisms to effectively capture rich contextual information and enhance feature extraction.This enhanced feature extraction is essential for accurately detecting complex image details, especially in crowded or complex scenes.When there are extensive inter-class occlusions in the detection, the miss rate (MR -2 ) becomes more crucial for evaluating the model.
In Figure 11, the detection outcomes of both YOLOv7 and YOLOv7-PD on the CityPersons dataset are displayed.We showcase extreme scenarios with smallscale objects and a high degree of occlusion.YOLOv7 exhibits challenges in detecting small pedestrian targets and those with a significant portion of their bodies occluded, resulting in missed detections and false alarms.In contrast, YOLOv7-PD demonstrates exceptional detection capabilities even when dealing with highly occluded pedestrians and small-sized pedestrians.

Conclusion
In this paper, we propose three improvement strategies aimed at enhancing the performance of YOLOv7 in pedestrian detection.First, we propose the ODConv based module DE-ELAN, which considers dynamics in spatial dimensions, input channels, and output channels.This module enhances the model's capability to extract features, particularly for small-scale and occluded pedestrians, by effectively suppressing interference noise associated with background features and strengthening critical feature information.

Figure 4
Figure 4The structure of Light-REFM

Figure 5 6 7
Figure 5 Partial data set display of Citypersons

Figure 8
Figure 8The results obtained by training on the Citypersons dataset

Figure 10 The
Figure 10 The MR and FPPI curves of the state-of-the-art method and the proposed YOLOv7-PD on the Caltech test set are presented in four different evaluation settings, a. Reasonable, b.Pedestrian height:(0,31), c.Pedestrian height:[31,81), d.Pedestrian height:(80,inf )

Figure 11
Figure 11The detection results of YOLOv7 and YOLOv7-PD using the CityPersons dataset

Table 2
[50]outcomes of the various weight ratios between different CIoU Loss and NWD on the CityPersons dataset[50]

Table 3
Comparison of results using various loss functions on YOLOv7

Table 4
Comparisons of YOLOv7-PD and other state-of-the-art methods on four Citypersons subsets

Table 5
Comparison between YOLOv7-PD and other one-stage methods on Citypersons dataset

Table 6
Comparing various crowded detection methods using the validation set of CrowdHuman