Helmet Detection Based on Context Enhancement Pyramid Under Surveillance Images

Helmet detection is of great significance for realizing the automated management of industrial safety. To address the problem that existing object detection methods have insufficient ability to detect helmet small objects under surveillance images, this paper proposes a helmet detection based on context enhancement pyramid under surveillance images to realize the automatic detection task of helmet objects. The method helps the network improve position localization for small-scale helmet objects by adding a high-resolution detection layer to YOLOv5. Also, the proposed context enhancement pyramid reduces the semantic differences between different scale features and generates rich contextual features to enhance the network’s discriminative learning ability for helmet small object features. In addition, the proposed multi-scale attention module improves the feature fusion effect in the pyramid network to further capture multi-scale features and expand the receptive field to enhance the network’s detection precision of helmet objects under surveillance images. The experimental analysis shows that the proposed method has good detection effect compared to existing object detection methods on the Safety Helmet Wearing Dataset (SHWD) as well as the customized dataset


Introduction
Due to the rapid development of convolutional neural networks, deep learning based methods have not only achieved remarkable results in the field of computer vision [15,18], but also been rapidly developed and widely applied in the fields of atmospheric monitoring [6,44], wireless transmission [45][46][47] and health assistance [16,19,26,32].And in the field of industrial safety monitoring, the use of deep learning based object detection methods to realize the automated detection of helmets has become one of the urgent problems to be solved in the current industrial safety management.
As we know, in the process of industrial production, the helmet can effectively avoid workers' head injury [39], thus protecting the safety of workers' lives [30].However, the current situation of workers wearing helmets still relies on manual methods for on-site supervision [5].This is very inefficient for complex construction sites and a large number of workers, and can easily lead to safety accidents.With the development of object detection technology based on deep learning, some scholars have begun to utilize advanced object detection methods and make various improvements based on them to realize the automatic detection task of helmet objects [8-9, 12, 20].However, for specific scenes such as surveillance images, the helmet object has a low imaging resolution and is susceptible to interference from the complex background environment as well as the influence of easy changes in the color of the target, which leads to the lack of discriminative learning ability of the existing deep learning based object detection methods on the features of small-scale helmet objects, which affects the detection model to differentiate between the small-scale helmet objects in the complex background environment.This affects the ability of the detection model to distinguish small-scale helmet objects in complex background environments, which in turn leads to the insufficient detection ability of the existing detection methods for small helmet objects.
In order to be able to improve the detection precision of small helmet objects under surveillance images, this paper proposes a context enhancement pyramid based helmet detection method under surveillance images.The method utilizes YOLOv5 [34] as a baseline and adds a high-resolution detection layer on top of it to help the detection network improve the positional spatial localization of helmet objects.Meanwhile, the proposed context enhancement pyramid structure is utilized to enhance the network's discriminative learning ability for helmet object features.In addition, the multi-scale attention module is used to further refine and capture the multi-scale context to help the detection network to predict the helmet object.Results of the experiments show that the method in this paper has good detection performance compared to mainstream object detection methods on both publicly available helmet datasets and custom datasets.The main contributions of this paper are as follows: 1 Add a high-resolution detection layer to the YOLOv5 network to reduce the feature loss of helmet small object targets during downsampling.
2 Propose a context enhancement pyramid to interactively fuse image features from shallow and deep layers to generate context-rich features, reduce semantic differences existing between features at different scales, and improve the discriminative learning ability of the network.
3 Propose a multi-scale attention module to further capture multi-scale features as well as expand the receptive field to improve the network's detection precision for small helmet objects.
The remainder of this paper is organized as follows.
Section 2 presents the work related to helmet detection.Section 3 describes the baseline YOLOv5 method and the proposed method in this paper.Section 4 gives the experimental results and analysis.Section 5 concludes the paper.

Related Work
In recent years, thanks to the rapid development of deep learning in various fields [48], computer vision technology has been widely used.At present, deep learning based object detection technology is widely used in the fields of intelligent manufacturing [25] and industrial automation [17,35,49] due to its excellent detection performance.In the field of industrial production safety monitoring, with the increasing popularity of the current video surveillance system, the use of object detection technology to realize the automatic detection of workers' helmet wearing [24] has become more and more important.In the early days, most of the helmet object detection utilized manual feature extraction to detect the helmet object [41].However, these traditional methods are more complicated and inefficient for helmet object detection.In recent years, many scholars have begun to use advanced object detection methods to realize the automatic detection task for helmets.
Wang [38] et al. propose a real-time helmet wear detection method by introducing a cross-stage localized network CSP [37] and a spatial pyramid pooling structure in the YOLOv3 [29] backbone to improve the learning ability of the network, and combining top-down and bottom-up feature fusion strategies to improve the detection of helmet objects at construction sites.Yang [42] et al. enhanced the multi-scale feature extraction capability of the YOLOv4 [1] backbone network and introduced a channel attention mechanism to dynamically focus on the channel features of helmet objects, thus improving the network's detection performance for small helmet objects.Fang [7] et al. added an attention mechanism to the backbone of YOLOv5 to make the network pay more attention to the region of interest, and at the same time combined the BiFPN [31] network structure in YOLOv5 to better fuse the finegrained features of the helmet object and to improve the detection accuracy of the helmet object under the condition of combining with migration learning.Chen [3] et al. reduced the computational complexity of the model by introducing a lightweight Ghost module into the backbone and neck feature extraction portions of the YOLOv5-S network, and combined it with a BiFPN structure to reconfigure the network for the helmet detection task.Gao [10] et al. propose a real-time helmet detection method based on the YOLOX [11] method, which strengthens the feature extraction capability of the network by adopting the recursive gated volume and BiFPN structure, and at the same time, adopts the training strategy of the SIOU cross-entropy loss function to further improve the detection precision of helmets.Lin [22] et al. added CBAM and super-resolution modules to YOLOX to extract foreground features and optimize image features as much as possible, and added a detection head for small objects of helmets to further improve the detection accuracy of helmets.Yu [43] et al. propose an improved helmet detection model based on YOLOv4, which significantly reduc-es the computational effort of the model by adding a depth-separable convolution to the YOLOv4 network to replace the traditional 3×3 convolution, and at the same time combines with multiscale prediction to realize real-time detection of helmets.Although the above methods can improve the detection precision of helmets to a certain extent and effectively increase the detection speed, they still do not propose effective solutions for the automated detection of small objects of helmets in surveillance image scenes.Therefore, the current mainstream target detection methods are still challenging for detecting small objects of helmets in surveillance image scenes.

Overview of the YOLOv5
According to the depth and width of the network, the YOLOv5 method can be divided into four different versions: YOLOv5-S, YOLOv5-M, YOLOv5-L, and YOLOv5-X.The YOLOv5 method continues the network structure of the YOLOv4 method, which consists of a backbone network, a neck structure, and a prediction network, respectively.Foucs, CSP structure, and SPP [13] modules are used in the backbone network for image feature extraction.The CSP structure enhances the learning expression of the backbone using cross-stage connectivity.Meanwhile, the SPP module is able to perform multi-scale feature mapping so as to fuse feature information from different scales.The neck structure adopts the PANet [23] structural idea to build a pyramid network by top-down and bottom-up paths, which further enhances the network's ability to extract image features.The prediction network is able to detect objects of different sizes with three different scales of features extracted from the neck, and calculate the object class, confidence score, and predicted object frame information.

Network Framework of the Proposed Method
The network structure of the proposed context enhancement pyramid based helme detection method under surveillance images is shown in Figure 1.The method extends the network structure of the YOLOv5 method by the proposed context enhancement pyramid and multi-scale attention module.Also, in order to be able to accurately localize the position of small-scale helmet objects under surveillance images, this paper adds a high-resolution detection layer to the YOLOv5 network.Among them, the context enhancement pyramid can reduce the semantic differences between different scales and generate rich contextual features to improve the network's discriminative learning ability for helmet objects.The multi-scale attention module can further refine the features and establish multiscale mapping and expand the receptive field to improve the network's detection precision for helmets.

High-resolution Detection Layer
Since the backbone network of YOLOv5 needs to perform multiple down-sampling processes when feature extraction of helmet objects is carried out, which is likely to lead to the gradual loss of spatial location information of the helmet objects in the surveillance image, thus reducing the accurate localization of the detection network for helmet small objects.Therefore, this paper adds a high-resolution detection layer P2 to the original 3-level detection layer of YOLOv5, as shown in Figure 1, in order to improve the accurate localization ability of the detection network for helmet objects.Specifically, the shallow image features N2 extracted from the backbone are extended and the proposed context enhancement pyramid network is utilized to obtain the high-resolution detection layer P2, which enables the detection layer P2 to retain richer spatial detail information of the helmet objects as much as possible, making the network more sensitive in dealing with small-scale helmet objects.

Context Enhancement Pyramid
The YOLOv5 method adopts the idea of PANet (Figure 2(b)) network structure to construct the pyramid network, although it can improve the feature information transfer of small objects in the FPN [21] (Figure 2(a)) structure in the network to a certain extent, the semantic differences that exist between features of different scales are still ignored.The deeper features in the PANet structure need to undergo many times of up-sampling in order to be fused with the shallowest features, and a large amount of abundant abstract semantics is gradually diluted, which in turn causes the lack of semantic information of shallow features.Meanwhile, the deep features lack sufficient contextual information around the object, resulting in the inability to precisely localize small-scale objects, which leads to the insufficient discriminative learning ability of the detection network for the helmet objects.
Therefore, inspired by the literature [5,14], we propose the context enhancement pyramid in this paper, as shown in Figure 2©.This structure is able to transfer shallow image features to deeper feature layers, and transfer the rich semantics contained in deeper image features to shallower feature layers, so as to generate rich contextual semantics, reduce the semantic differences between features of different scales in the pyramid network, and then guide the feature construction process of the pyramid network, in order to improve the network's discriminative learning ability for the helmet objects.Among them, the semantic refinement module is used to eliminate the redundant contexts in the deep image features and refine the helmet object feature information.
1 Context enhancement pyramid: The context enhancement pyramid is able to inject shallow image features containing a large number of helmet object spatial details directly into the deeper feature layer, so that smaller scale helmet object features will not be easily lost.Moreover, deep image features containing rich semantics are up-sampled across stages and transferred to shallow image features to make up for the lack of semantic information of shallow image features and generate rich contextual features to reduce the semantic differences of features of different scales and improve the network's discriminative learning ability of the helmet objects.Specifically, in order to interactively fuse the shallow features with the deeper ones, this paper adds a bottom-up extended path and two top-down extended cross-stage paths to the PANet network structure, so that the deeper feature layer and the shallower one can effectively obtain the required spatial information of the target location as well as sufficient abstract semantics.The fusion approach uses Concat to preserve as many contextual features as possible.It can be described as: pyramid network is utilized to obtain the highresolution detection layer P2, which enables the detection layer P2 to retain richer spatial detail information of the helmet objects as much as possible, making the network more sensitive in dealing with small-scale helmet objects.

Context Enhancement Pyramid
The YOLOv5 method adopts the idea of PANet (Figure 2(b)) network structure to construct the pyramid network, although it can improve the feature information transfer of small objects in the FPN [21] (Figure 2(a)) structure in the network to a certain extent, the semantic differences that exist between features of different scales are still ignored.The deeper features in the PANet structure need to undergo many times of up-sampling in order to be fused with the shallowest features, and a large amount of abundant abstract semantics is gradually diluted, which in turn causes the lack of semantic information of shallow features.Meanwhile, the deep features lack sufficient contextual information around the object, resulting in the inability to precisely localize small-scale objects, which leads to the insufficient discriminative learning ability of the detection network for the helmet objects.
Therefore, inspired by the literature [5,14], we propose the context enhancement pyramid in this paper, as shown in Figure 2©.This structure is able to transfer shallow image features to deeper feature layers, and transfer the rich semantics contained in deeper image features to shallower feature layers, so as to generate rich contextual semantics, reduce the semantic differences between features of different scales in the pyramid network, and then guide the feature construction process of the pyramid network, in order to improve the network's discriminative learning ability for the helmet objects.Among them, the semantic refinement module is used to eliminate the redundant contexts in the deep image features and refine the helmet object feature information.
(1) Context enhancement pyramid: The context enhancement pyramid is able to inject shallow image features containing a large number of helmet object spatial details directly into the deeper feature layer, so that smaller scale helmet object features will not be easily lost.Moreover, deep image features containing rich semantics are up-sampled across stages and transferred to shallow image features to make up for the lack of semantic information of shallow image features and generate rich contextual features to reduce the semantic differences of features of different scales and improve the network's discriminative learning ability of the helmet objects.Specifically, in order to interactively fuse the shallow features with the deeper ones, this paper adds a bottom-up extended path and two top-down extended cross-stage paths to the PANet network structure, so that the deeper feature layer and the shallower one can effectively obtain the required spatial information of the target location as well as sufficient abstract semantics.The fusion approach uses Concat to preserve as many contextual features as possible.It can be described as: ( pyramid network is utilized to obtain the highresolution detection layer P2, which enables the detection layer P2 to retain richer spatial detail information of the helmet objects as much as possible, making the network more sensitive in dealing with small-scale helmet objects.

Context Enhancement Pyramid
The YOLOv5 method adopts the idea of PANet (Figure 2(b)) network structure to construct the pyramid network, although it can improve the feature information transfer of small objects in the FPN [21] (Figure 2(a)) structure in the network to a certain extent, the semantic differences that exist between features of different scales are still ignored.The deeper features in the PANet structure need to undergo many times of up-sampling in order to be fused with the shallowest features, and a large amount of abundant abstract semantics is gradually diluted, which in turn causes the lack of semantic information of shallow features.Meanwhile, the deep features lack sufficient contextual information around the object, resulting in the inability to precisely localize small-scale objects, which leads to the insufficient discriminative learning ability of the detection network for the helmet objects.
Therefore, inspired by the literature [5,14], we propose the context enhancement pyramid in this paper, as shown in Figure 2©.This structure is able to transfer shallow image features to deeper feature layers, and transfer the rich semantics contained in deeper image features to shallower feature layers, so as to generate rich contextual semantics, reduce the semantic differences between features of different scales in the pyramid network, and then guide the feature construction process of the pyramid network, in order to improve the network's discriminative learning ability for the helmet objects.Among them, the semantic refinement module is used to eliminate the redundant contexts in the deep image features and refine the helmet object feature information.
(1) Context enhancement pyramid: The context enhancement pyramid is able to inject shallow image features containing a large number of helmet object spatial details directly into the deeper feature layer, so that smaller scale helmet object features will not be easily lost.Moreover, deep image features containing rich semantics are up-sampled across stages and transferred to shallow image features to make up for the lack of semantic information of shallow image features and generate rich contextual features to reduce the semantic differences of features of different scales and improve the network's discriminative learning ability of the helmet objects.Specifically, in order to interactively fuse the shallow features with the deeper ones, this paper adds a bottom-up extended path and two top-down extended cross-stage paths to the PANet network structure, so that the deeper feature layer and the shallower one can effectively obtain the required spatial information of the target location as well as sufficient abstract semantics.The fusion approach uses Concat to preserve as many contextual features as possible.It can be described as: ( pyramid network is utilized to obtain the highresolution detection layer P2, which enables the detection layer P2 to retain richer spatial detail information of the helmet objects as much as possible, making the network more sensitive in dealing with small-scale helmet objects.

Context Enhancement Pyramid
The YOLOv5 method adopts the idea of PANet (Figure 2(b)) network structure to construct the pyramid network, although it can improve the feature information transfer of small objects in the FPN [21] (Figure 2(a)) structure in the network to a certain extent, the semantic differences that exist between features of different scales are still ignored.The deeper features in the PANet structure need to undergo many times of up-sampling in order to be fused with the shallowest features, and a large amount of abundant abstract semantics is gradually diluted, which in turn causes the lack of semantic information of shallow features.Meanwhile, the deep features lack sufficient contextual information around the object, resulting in the inability to precisely localize small-scale objects, which leads to the insufficient discriminative learning ability of the detection network for the helmet objects.
Therefore, inspired by the literature [5,14], we propose the context enhancement pyramid in this paper, as shown in Figure 2©.This structure is able to transfer shallow image features to deeper feature layers, and transfer the rich semantics contained in deeper image features to shallower feature layers, so as to generate rich contextual semantics, reduce the semantic differences between features of different scales in the pyramid network, and then guide the feature construction process of the pyramid network, in order to improve the network's discriminative learning ability for the helmet objects.Among them, the semantic refinement module is used to eliminate the redundant contexts in the deep image features and refine the helmet object feature information.
(1) Context enhancement pyramid: The context enhancement pyramid is able to inject shallow image features containing a large number of helmet object spatial details directly into the deeper feature layer, so that smaller scale helmet object features will not be easily lost.Moreover, deep image features containing rich semantics are up-sampled across stages and transferred to shallow image features to make up for the lack of semantic information of shallow image features and generate rich contextual features to reduce the semantic differences of features of different scales and improve the network's discriminative learning ability of the helmet objects.Specifically, in order to interactively fuse the shallow features with the deeper ones, this paper adds a bottom-up extended path and two top-down extended cross-stage paths to the PANet network structure, so that the deeper feature layer and the shallower one can effectively obtain the required spatial information of the target location as well as sufficient abstract semantics.The fusion approach uses Concat to preserve as many contextual features as possible.It can be described as: Where Upsample × indicates a 4x upsampling oper- ation using nearest neighbor interpolation.The network structure of the proposed context enhancement pyramid based helme detection method under surveillance images is shown in Figure 1.The method extends the network structure of the YOLOv5 method by the proposed context enhancement pyramid and multi-scale attention module.Also, in order to This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multi-scale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the similarity matrix and N H W = × , which is expressed by the formula: image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the similarity matrix convolution and usi attention weights and Then, they are output with feature R5 and f expressed by the form Where  is a Sigmo denotes a convoluti convolutional kernel s

Multi-scale Atten
In order to improve th in the pyramid netwo where the helmet obje features from being information, we pro attention module, as This module is able to during the constru network, while cap information inside different scales and receptive field as a wa fusion effect in the increase the network' helmet objects. Then, , and then ∈ � are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the similarity matrix Then,   , and then   are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: ; Where  is a Sigmoid function, while 7 7  f  denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features during the construction of the pyramid network, while capturing the multi-scale information inside the image features of different scales and further expanding the receptive field as a way to improve the feature fusion effect in the pyramid network and increase the network's detection precision for helmet objects.(2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the similarity matrix Then,   , and then   are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: ; Where  is a Sigmoid function, while 7 7  f  denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features during the construction of the pyramid network, while capturing the multi-scale information inside the image features of different scales and further expanding the receptive field as a way to improve the feature fusion effect in the pyramid network and increase the network's detection precision for helmet objects.Where σ is a Sigmoid function, while 7 7  f × denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features during the construction of the pyramid network, while capturing the multi-scale information inside pyramid network is utilized to obtain the highresolution detection layer P2, which enables the detection layer P2 to retain richer spatial detail information of the helmet objects as much as possible, making the network more sensitive in dealing with small-scale helmet objects.

Context Enhancement Pyramid
The YOLOv5 method adopts the idea of PANet (Figure 2(b)) network structure to construct the pyramid network, although it can improve the feature information transfer of small objects in the FPN [21] (Figure 2(a)) structure in the network to a certain extent, the semantic differences that exist between features of different scales are still ignored.The deeper features in the PANet structure need to undergo many times of up-sampling in order to be fused with the shallowest features, and a large amount of abundant abstract semantics is gradually diluted, which in turn causes the lack of semantic information of shallow features.Meanwhile, the deep features lack sufficient contextual information around the object, resulting in the inability to precisely localize small-scale objects, which leads to the insufficient discriminative learning ability of the detection network for the helmet objects.
Therefore, inspired by the literature [5,14], we propose the context enhancement pyramid in this paper, as shown in Figure 2©.This structure is able to transfer shallow image features to deeper feature layers, and transfer the rich semantics contained in deeper image features to shallower feature layers, so as to generate rich contextual semantics, reduce the semantic differences between features of different scales in the pyramid network, and then guide the feature construction process of the pyramid network, in order to improve the helmet objects.Among them, the semantic refinement module is used to eliminate the redundant contexts in the deep image features and refine the helmet object feature information.
(1) Context enhancement pyramid: The context enhancement pyramid is able to inject shallow image features containing a large number of helmet object spatial details directly into the deeper feature layer, so that smaller scale helmet object features will not be easily lost.Moreover, deep image features containing rich semantics are up-sampled across stages and transferred to shallow image features to make up for the lack of semantic information of shallow image features and generate rich contextual features to reduce the semantic differences of features of different scales and improve the network's discriminative learning ability of the helmet objects.Specifically, in order to interactively fuse the shallow features with the deeper ones, this paper adds a bottom-up extended path and two top-down extended cross-stage paths to the PANet network structure, so that the deeper feature layer and the shallower one can effectively obtain the required spatial information of the target location as well as sufficient abstract semantics.The fusion approach uses Concat to preserve as many contextual features as possible.It can be described as: (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as   ℝ �×�×� , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the similarity matrix and N H W   , which is expressed by the formula: Then,   ℝ �×� is adjusted to  , and then  and  are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: ; Where  is a Sigmoid function, while 7 7  f  denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features

ment module (SRM):
In order al rich semantics in the deep g interfered by the redundant tion from the shallow features, e confusion and resulting in a etwork's ability to learn the f the helmet target.Therefore, a semantic refinement module ure [2], as shown in Figure 3. ble to establish long-range e original deep image features attention weights to refine the mation in the image features, network to improve the ing ability for the helmet object.P module is used to fuse multies to mine richer small-scale e information.fused feature M5 first, the SPP r multi-scale mapping to obtain , the deep features N5 extracted regarded as   ℝ �×�×� , and nally transformed using 1×1 s Then,   ℝ �×� is adjusted to  , and then  and  are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: ; Where  is a Sigmoid function, while 7 7  f  denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features the image features of different scales and further expanding the receptive field as a way to improve the feature fusion effect in the pyramid network and increase the network's detection precision for helmet objects.
Specifically, the multi-scale attention module first refines the features from both the channel and spatial dimensions using CBAM (Figure 4(b)) [40] from channel attention and spatial attention, respectively, so as to filter out a large amount of noise and redundant information contained in the image features and to highlight the key helmet object features, which can be described as: Specifically, the multi-scale attention module first refines the features from both the channel and spatial dimensions using CBAM (Figure 4(b)) [40] from channel attention and spatial attention, respectively, so as to filter out a large amount of noise and redundant information contained in the image features and to highlight the key helmet object features, which can be described as: , Where,  is the Sigmoid function, MLP is the multi-layer perceptron, and

Custom Dataset
To further evaluate the effectiveness of the proposed method in detecting helmet objects in surveillance image scenarios, this paper expands 3000 surveillance images on the basis of SHWD.The expanded surveillance images are all from real industrial production environments and manually labeled according (7) Specifically, the multi-scale attention module first refines the features from both the channel and spatial dimensions using CBAM (Figure 4(b)) [40] from channel attention and spatial attention, respectively, so as to filter out a large amount of noise and redundant information contained in the image features and to highlight the key helmet object features, which can be described as: , Where,  is the Sigmoid function, MLP is the multi-layer perceptron, and

Custom Dataset
To further evaluate the effectiveness of the proposed method in detecting helmet objects in surveillance image scenarios, this paper expands 3000 surveillance images on the basis of SHWD.The expanded surveillance images are all from real industrial production environments and manually labeled according (8) Where, σ is the Sigmoid function, MLP is the multi-layer perceptron, and 7 7  f × is the 7×7 convolution operation.max denote the maximum pooling and average pooling process from the spatial dimension.Then, for the image features that have been refined by CBAM, adaptive global average pooling is then used to aggregate the global semantic information, while the input features are mapped using 1×1 convolution as well as 3×3 convolution with different dilation rates, and then the output features are subjected to the elemental summation operation in order to avoid the loss of helmet object feature information.Finally, the output features from the four branches are fused using Concat and 1×1 convolution.One long residual edge is used to maintain the integrity of the information inside the image features.(2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the Then,   , and then   and   are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: ; Where  is a Sigmoid function, while 7 7  f  denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features during the construction of the pyramid network, while capturing the multi-scale information inside the image features of different scales and further expanding the (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as , and they are convolutionally transformed using 1×1 convolutional layers   , and then   and   are obtained using the max pooling and average pooling process, followed by fusion using 7×7 convolution and using Sigmoid to obtain attention weights and weighting to feature R5.Then, they are output after element summing with feature R5 and feature M5, respectively, expressed by the formula: ; Where  is a Sigmoid function, while 7 7  f  denotes a convolutional operation with a convolutional kernel size of 7 × 7.

Multi-scale Attention Module (MSAM)
In order to improve the effect of feature fusion in the pyramid network and avoid the region where the helmet object is located in the image features from being affected by redundant information, we proposed the multi-scale attention module, as shown in Figure 4(a).This module is able to refine the fused features during the construction of the pyramid network, while capturing the multi-scale (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as   ℝ �×�×� , and they are convolutionally transformed using 1×1 convolutional layers Where  is a Sigmo denotes a convoluti convolutional kernel

Multi-scale Atten
In order to improve th in the pyramid netwo where the helmet obje features from being information, we pro attention module, as This module is able to (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as   ℝ �×�×� , and they are convolutionally transformed using 1×1 convolutional layers (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.
Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as   ℝ �×�×� , and they are convolutionally transformed using 1×1 convolutional layers (2) Semantic refinement module (SRM): In order to avoid the original rich semantics in the deep image features being interfered by the redundant background information from the shallow features, thus causing feature confusion and resulting in a reduction in the network's ability to learn the semantic features of the helmet target.Therefore, this paper proposes a semantic refinement module based on the literature [2], as shown in Figure 3.This module is able to establish long-range dependencies on the original deep image features and generate spatial attention weights to refine the helmet object information in the image features, thus enabling the network to improve the discriminative learning ability for the helmet object.Among them, the SPP module is used to fuse multiscale spatial features to mine richer small-scale helmet object feature information.Specifically, for the fused feature M5 first, the SPP module is utilized for multi-scale mapping to obtain the feature R5.Then, the deep features N5 extracted by the backbone are regarded as   ℝ �×�×� , and they are convolutionally transformed using 1×1 convolutional layers , respectively.Then Q and K are multiplied by matrix to obtain the similarity matrix and N H W   , w expressed by the formula: Then,   ℝ �×� is adjusted to E  and then  and avg E  obtained using the max pooling and pooling process, followed by fusion u convolution and using Sigmoid to attention weights and weighting to fea Then, they are output after element s with feature R5 and feature M5, resp expressed by the formula:    7 7   ; Where  is a Sigmoid function, wh denotes a convolutional operation convolutional kernel size of 7 × 7.

Multi-scale Attention Module (M
In order to improve the effect of featu in the pyramid network and avoid th where the helmet object is located in th features from being affected by re information, we proposed the mu attention module, as shown in Figu This module is able to refine the fused The publicly available Safety Helmet Wearing Dataset (SHWD) [27] has 7581 images containing a total of two categories of labels, hat and person.person labels are derived from the SCUT-HEAD [28] dataset to simulate the unworn helmet objects.Among them, there are 9047 hat tags and 111514 person tags.In this paper, we divide the dataset into training and test sets according to the ratio of 8:2, and 10% of the training set is classified as the validation set for validation.
There are 5457 images in the training set, 607 images in the validation set, and 1517 images in the test set.
Where, the training set, validation set as well as the test set contains the number of hat labels and person labels as shown in the Table 2.  2.
In addition, since the expanded surveillance images are all from real construction environments, this paper utilizes mosaics in the detection effect images shown to obscure textual information such as the construction location and time displayed in the upper left corner of the surveillance images.

Evaluation Metrics
This paper uses average precision (AP), mean average precision (mAP) and Frames Per Second (FPS) as evaluation metrics.Where AP is denoted as the area under the curve after the multiplication of precision(P) and recall(R), and mAP is denoted as the average of the AP of the two object categories.The formulas for precision(P) and recall(R) are denoted as: valuation Metrics paper uses average precision (AP), mean ge precision (mAP) and Frames Per Second ) as evaluation metrics.Where AP is denoted as rea under the curve after the multiplication of sion(P) and recall(R), and mAP is denoted as verage of the AP of the two object categories.formulas for precision(P) and recall(R) are ted as: e, TP denotes the number of samples that were ted as correct and were actually positive les.FP denotes the number of samples that detected as correct but were actually negative samples.FN denotes the number of samples that were categorized as negative samples but were actually positive samples.mAP is calculated using an IoU threshold of 0.5.

Experiments on SHWD
To verify the detection effect of the proposed method for the helmet objects, the proposed method is compared with different methods on the SHWD in this paper, and the experimental results are shown in Table 3.
From Table 3, the proposed method shows good detection results on SHWD.The proposed Ours-S method improves the mAP of YOLOv3 and YOLOv4 by 4.36% and 3.14%, respectively, compared to the YOLOv3 and YOLOv4 with an input size of 416 × 416, and the mAP value of Ours-S can be improved by 7.11% compared to the CenterNet [50] method with an input size of 512 × 512.Also, the mAP of Ours-S can be improved by 1.22% and 0.7% compared to the baseline method YOLOv5-S and the advanced YOLOX-S method, respectively.For the Ours-M, Ours-L, and Ours-X methods, the mAP can be improved by 1.88%, 1.38%, and 1.45% compared to the baseline methods YOLOv5-M, YOLOv5-L, and YOLOv5-X, respectively.In addition, Ours-X was able to increase the mAP by 4.12% and 2.68% compared to the YOLOv3 and YOLOv4 methods with input size of 608 × 608, respectively, and by 7.96% compared to the FCOS [33] method, and Ours-X was able e 6 etection effect of Ours-X in the test set with custom dataset (9) where, TP denotes the number of samples that were detected as correct and were actually positive samples.FP denotes the number of samples that were detected as correct but were actually negative samples.FN denotes the number of samples that were categorized as negative samples but were actually positive samples.mAP is calculated using an IoU threshold of 0.5.

Experiments on SHWD
To verify the detection effect of the proposed method for the helmet objects, the proposed method is compared with different methods on the SHWD in this paper, and the experimental results are shown in Table 3. From Table 3, the proposed method shows good detection results on SHWD.The proposed Ours-S method improves the mAP of YOLOv3 and YOLOv4 by 4.36% and 3.14%, respectively, compared to the YOLOv3 and YOLOv4 with an input size of 416 × 416, and the mAP value of Ours-S can be improved by 7.11% compared to the CenterNet [50] method with an input size of 512 × 512.Also, the mAP of Ours-S can be improved by 1.22% and 0.7% compared to the baseline method YOLOv5-S and the advanced YOLOX-S method, respectively.For the Ours-M, Ours-L, and Ours-X methods, the mAP can be improved by 1.88%, 1.38%, and 1.45% compared to the baseline methods YOLOv5-M, YOLOv5-L, and YOLOv5-X, respectively.In addition, Ours-X was able to increase the mAP by 4.12% and 2.68% compared to the YOLOv3 and    YOLOv4 methods with input size of 608 × 608, respectively, and by 7.96% compared to the FCOS [33] method, and Ours-X was able to increase the mAP by 2.87% and 1.87% compared to the advanced YOLOX-X and YOLOv7 [36] methods, respectively.Figure 6(a) shows the training and validation loss curves and the mAP variation curves of the proposed method on SHWD, respectively, and it can be seen that with the adoption of a more powerful baseline network, it can make the training and validation loss converge faster and make the mAP value increase continuously.Figure 7 shows the detection results of Ours-X method on the test set classified in this paper.

Experiments on Custom Dataset
To further verify the proposed method's detection effect for small-scale helmet objects under surveillance images, the proposed method is compared with different detection methods on the custom dataset in this paper, and the experimental results are shown in Table 4.
From The detection effect of Ours-X in the test set with custom dataset high-resolution detection layer P2 to the baseline can make the APs of mAP and helmet improve by 1.94% and 2.29%, The detection effect of Ours-X in the test set with custom dataset respectively, indicating that P2 can enhance the network's location localization of helmet objects.
Adding CEP containing SRM to the baseline can improve the AP of mAP and helmet by 2.44% and 3.09% respectively, indicating that CEP can utilize the generated rich context to enhance the network's discriminative learning ability for helmet objects.However, when CEP without SRM is added, it only help of SRM, a large amount of redundant contextual information will semantically interfere with the deep image features and reduce the network's ability to learn about the helmet objects.When MSAM is added to the baseline, it can make the mAP and the AP value for helmet improve by 1.71% and 2.59%, respectively, indicating that MSAM can Figure 8 demonstrates some of the detection results of Ours-X on the test set, and it can be seen that the proposed method is able to better detect the distant small-scale helmet objects in the surveillance images in different construction operation environments.

Ablation Analysis
1 Impact of different modules: In order to evaluate the effects of context enhancement pyramid (CEP), multi-scale attention module (MSAM), and the addition of high-resolution detection layer on the baseline YOLOv5 detection performance, we conduct ablation experiments on a customized dataset using YOLOv5-S as the baseline, and the results of the experiments are shown in Table 5. Where, P2 denotes for the added high resolution detection layer and SRM denotes semantic refinement module.
From Table 5, it can be seen that adding the high-resolution detection layer P2 to the baseline can make the APs of mAP and helmet improve by 1.94% and 2.29%, respectively, indicating that P2 can enhance the network's location localization of helmet objects.Adding CEP containing SRM to the baseline can improve the AP of mAP and helmet by 2.44% and 3.09% respectively, indicating that CEP can utilize the generated rich context to enhance the network's discriminative learning ability for helmet objects.However, when CEP without SRM is added, it only enhances the mAP and the AP of helmet by 1.86% and 2.76%, respectively, indicating that without the help of SRM, a large amount of redundant contextual information will se-mantically interfere with the deep image features and reduce the network's ability to learn about the helmet objects.When MSAM is added to the baseline, it can make the mAP and the AP value for helmet improve by 1.71% and 2.59%, respectively, indicating that MSAM can improve the feature fusion effect during the construction of pyramid network and increase the detection precision of the network for helmet objects.
In addition, when the CEP containing SRM is added on top of P2, the mAP can be improved by 3.08%, indicating that the context enhancement pyramid combined with the high-resolution detection layer P2 can further improve the detection effect of helmet objects.And when MSAM is added to P2, it can make the mAP improve by 2.73%, which illustrates the contribution of MSAM to improve the pyramid network construction.Adding CEP and MSAM containing SRM to the baseline only improves the mAP and helmet AP by 1.94% and 2.49%, respectively, indicating that the network is not capable of localizing the position of small-scale helmet objects in the absence of a high-resolution detection layer.In comparison, the proposed method in this paper is able to improve the mAP and the AP for helmet by 3.6% and 5.08%, respectively, which further illustrates the effectiveness of the proposed method. 2 Impact of pre-trained backbone networks: In addition, in order to further evaluate the impact of loading the pre-trained backbone network during the training process of the proposed method on the experimental results, this paper analyzes the proposed method experimentally on a custom dataset, and the experiment results are shown in Table 6.
From Table 6, it can be seen that the proposed Ours-S, Ours-M, Ours-L, and Ours-X methods were able to improve the mAP values by 8.29%, 9.4%, 11.19%, and 10.2% under the condition of loading the pre-trained backbone network as compared to the training without loading the pre-trained backbone, respectively.Meanwhile, the AP values for the helmet target were able to increase by 12.84%, 14.65%, 17.03%, 15.84% respectively.Therefore, it can be inferred that loading the pre-trained backbone network for migration learning during the training process can significantly improve the detection performance of the proposed method for helmets compared to not loading the pretrained backbone network.
From the Table 7, it can be seen that when the proposed method does not use the Monsaic data augmentation and Cosine annealing learning rate strategy, the mAP can only reach 79.35%, and the AP of the helmet can only reach 74.93%, which indicates that the network does not fully learn the complex feature information of the helmet objects in the dataset during the training process and does not converge to the optimal solution.When only Monsaic data augmentation is used for training, the mAP can reach 81.66%, which indicates that the Monsaic data augmentation strategy can help the network to improve the learning ability of the helmet objects.When only the Cosine annealing learning rate is used, the mAP reaches 83.28%, which indicates that the Cosine annealing learning rate can help the network to find the optimal solution during the training process and improve the training effect and detection performance of the network.In contrast, the proposed method in this paper uses both Monsaic data augmentation and Cosine annealing learning rate training strategy, which can make the mAP reach 84.99%, and further improve the network's detection performance for helmet objects.

Conclusion
This paper proposes a helmet detection based on context enhancement pyramid under surveillance images to realize the automatic detection task for helmets in industrial production processes.The method helps the network to accurately localize the helmet objects by adding a high-resolution detection layer to the YOLOv5 network.Meanwhile, the proposed context enhancement pyramid interactively fuses image features from both shallow and deep layers to enhance the network's discriminative learning ability for helmet object features.In addition, the proposed multiscale attention module is used to improve the feature fusion effect during the construction of the pyramid network, which further improves the detection precision of the network for helmet.Experimental results show that the method proposed in this paper has good detection performance for helmet objects in surveillance image scenarios compared with mainstream object detection methods.Future work will further consider automated helmet object detection tasks in more complex environments.

Figure 1
Figure 1 Network framework of the proposed method
to replace the traditional 3×3 convolution, and at the same time combines with multiscale prediction to realize real-time detection of helmets.Although the above methods can improve the detection precision of helmets to a certain extent and effectively increase the detection speed, they still do not propose effective solutions
m pooling process, follo convolution and us attention weights and Then, they are output with feature R5 and f expressed by the form  7 7

Figure 5
Figure 5 Example of labeling

Figure 6
Figure 6 Training and validation loss curves and mAP change curves of the proposed method on SHWD and custom dataset

Figure 7
Figure 7The detection effect of Ours-X in the SHWD test set The detection effect of Ours-X in the SHWD test set

Table 1
Comparison of the number of images in the training set, validation set, and test set in SHWD and Custom dataset

Table 2
Comparison of the number of labels included in the training set, validation set, and test set in SHWD and Custom dataset

Table 2
Comparison of the number of labels included in the training set, validation set, and test set in SHWD and Custom dataset

Table 3
Different methods for comparison of results on SHWD

Table 3
Different methods for comparison of results on SHWD

Table 3
Different methods for comparison of results on SHWD

Table 4
Different methods for comparing results on custom dataset Method Input size AP(hat)/% AP(person)/%

Table 4
Different methods for comparing results on custom dataset

Table 5
The results of the ablation experiments

Table 6
Comparison of experimental results on whether to load a pre-trained backbone on custom dataset

Table 7
Comparison of experimental results with hyper-parameters settings on custom dataset