Elderly Fall Detection Algorithm Based on Improved YOLOv5s

The indoor fall detection for the elderly can effectively help the treatment after falling, but many existing detection methods have the problems of inconvenient use, high misjudgement rate and slow speed. Using deep learning methods can effectively solve these problems, and YOLOv5s is a kind of deep learning algorithm that can perform real-time fall detection. In order to achieve a more lightweight and higher detection accuracy, this paper proposes a fall detection algorithm for the elderly based on improved YOLOv5s, called YOLOv5s-GCC. Firstly, the original Conv and C3 structures are replaced by GhostConv and C3GhostV2 structures in back-bone to achieve model lightweight, which reduces model computation and improves accuracy. Secondly, the lightweight upsampling operator CARAFE is introduced to expand the receptive field for data feature fusion and reduce the loss of feature information in upsampling. Finally, the deepest C3 is integrated with CBAM attention mechanism in the neck, because the deepest neck receives more abundant feature information, and CBAM can increase the efficiency of the algorithm in extracting important information from the feature map. Experimental results show that YOLOv5s-GCC has increased by 1.2% to 0.935 on the hybrid open source fall dataset mAP@0.5; FLOPs decreased by 29.1%. Params are reduced by 27.5% and have obvious advantages over similar object detection algorithms.


Introduction
According to the seventh national census data of the National Bureau of Statistics of China, China's population is projected to surpass 1.4 billion in 2020, exhibiting an average annual growth rate of 0.53%.Among this demographic, individuals aged 60 and above are expected to exceed 260 million, constituting approximately 18.7% of the national population [2].Given the progress in economy and society, ensuring the well-being of empty nesters has become increasingly imperative.Data from China's disease surveillance system reveals that falls remain a predominant cause of injury among elderly individuals nationwide.In fact, over 20% of such incidents result in severe injuries for older adults in China; even healthy seniors face a staggering probability (17%) of becoming serious patients due to falls [4].Consequently, timely detection and notification regarding falls play a pivotal role in safeguarding the safety and security of empty nesters.
At present, the real-time fall detection is mainly divided into three types: wearable fall detection, environmental fall detection and computer vision fall detection.The main problem of the first method is that the elderly may forget to wear the device in daily life, and the main problem of the second method is that the layout cost of the environmental equipment is very high, and the misjudgement rate of these two methods is generally high.
Fall detection based on computer vision has the characteristics of convenient use, low misjudgement rate, good real-time performance, and many applicable scenarios.The feature extraction algorithms can be mainly divided into three types: threshold analysis, detection algorithm based on machine learning, and detection algorithm based on deep learning [28].

Related Work
Since 2012, when Krzhevsky et al. [11] proposed the AlexNet model, the deep learning model using convolutional neural network (CNN) for feature extraction has gradually become popular, and has been widely used in fall detection.Nowadays, the object detection algorithm based on deep learning can be divided into one-stage and two-stage.The two-stage algorithms include R-CNN [6], Fast RCNN [5], Faster RCNN [18], and R-FCN [3], but the processing process of these algorithms is relatively cumbersome, and the detection speed is slow.The representative of onestage algorithm is SSD [13] algorithm and YOLO [1], [15][16][17] series, which have the characteristics of fast detection speed, good real-time performance, and still maintain good accuracy.For example, Wang et al. [24] used YOLOv3 algorithm and introduced anchor point parameters to detect human falls.Pan-igrahi et al. [21] proposed an improved lightweight MS-ML-SNYOLOv3 network to obtain better detection results by increasing the receptive field.Li et al. [12] improved the YOLOv5 network by embedding the SE (Squeeze-and-Excitation, SE [10]) channel attention mechanism.SE pools the global average channel information of the input feature map, and then normalizes the compressed information and multiplies it on the input feature map, enhancing the model's ability to capture the information of the object of interest and improving the detection accuracy.Shen et al. [20] proposed a reparameterized backbone network, in which the Conv module was replaced by DBBConv and DBBC3 modules, and proposed a new feature enhancement module (FEM) to enhance the feature representation and feature fusion of the region of interest (ROI), and added FEM to the feature pyramid network (FPN) to improve detection accuracy.Yang et al. [25] proposed MSF-YOLO to fuse multi-scale features of images.Compared with the original Res-Net unit, the single convolution scale is increased to four convolution scales, and the features under each different perception field are fused to obtain rich hierarchical information from the image.Zhao et al. [27] proposed a novel attention module SDI based on coordinate attention and aliasing attention.The module enhances the feature extraction ability of detection targets.They proposed a novel convolutional neural network model for fall detection in open space, named YOLO-Fall.The above methods have achieved good results in fall detection, but there are still problems such as large model volume, complex parameter quantity, poor model feature fusion ability, and insufficient attention mechanism introduced.If the algorithm has a large number of parameters, it is difficult to deploy to other mobile devices for real-time fall detection.
In view of the above problems, this paper proposes an elderly fall detection algorithm based on improved YOLOv5s, which achieves the balance between fast detection and high accuracy, and meets the conditions for deployment on hardware platforms.The main contents are as follows: (1) Design a lightweight backbone network, use the GhostNetV2 network idea, replace the bottleneck structure in the original backbone network C3 module with the GhostNetV2 bottleneck module in the GhostNetV2 network, build the C3GhostV2 module, and use it with Ghostconv

YOLOv5 Detection Algorithm
YOLOv5 is a classic algorithm of the YOLO series, including YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x four models.All models are composed of four parts: input, backbone network, neck network, and output detection end.The input is responsible for receiving image data and preprocessing it to adapt to the input requirements of the model, such as adjusting the image size to the size required by the model, normalizing the image, etc.The backbone network is the core of the entire model and is responsible for extracting the feature information of the input image.The neck network is located between the backbone network and the output detection end and is responsible for further extracting and integrating the feature information extracted by the backbone network to better adapt to the specific object detection task.The output detection end is the last part of the model and is responsible for outputting the category, location, confidence and other information of the object according to the feature information passed by the neck network and the requirements of the predefined object detection task, so as to complete the object detection task.recognition accuracy of the model to complex scenes through finer-grained feature processing.In addition, the C3 module improves the parameter efficiency of the network through this optimized convolutional layer design, that is, while maintaining or improving the feature extraction ability, the number of parameters and computational complexity are minimized.In YOLOv5-6.0version, the SPP structure is replaced by SPPF structure, which uses multiple small-sized pooled kernel cascades to fuse feature maps of different receptive fields at a faster running speed.The neck network adopts FPN+PANet feature pyramid.FPN transfers rich semantic information from deep to shallow for feature fusion from top to bottom, and PANet transfers stronger position information from shallow to deep for feature fusion from bottom to top.
The output detection end contains three detection layers of different sizes, corresponding to three different size feature maps in the neck network.CIOU_ Loss is used as the loss function to measure the difference between predicted frames and real frames, and non-maximum suppression NMS is introduced to filter repetitive predicted frames, improving the detection efficiency.Because YOLOv5s is the model with the smallest convolution depth and feature map width in YOLOv5, with the smallest amount of calculation and parameters, it is suitable for deployment to other mobile devices, which is in line with the research direction of this paper, YOLOv5s-6.0version is selected as the improvement object.The model structure of YOLOv5s-v6.0 is shown in Figure 1.
In order to improve the detection performance of the model, this paper designed the object detection model YOLOv5s-GCC based on YOLOv5s as the benchmark model.The improved model is shown in Figure 2.

Lightweight Backbone Design Based on GhostNetV2
Traditional feature networks have the characteristics of redundant feature information and large amount of parameters.In order to solve these problems, the mainstream lightweight networks at present mainly include MobileNet [9], ShuffleNet [26] and Ghost-Net [7].The MobileNet model divides the standard volume into depthwise convolution and pointwise convolution, inverted residuals are introduced in Mo-bileNetV2 [19], and SE attention mechanism module is introduced in MobileNetV3 [8].In ShuffleNet, channel shuffle operation is used to help information flow in feature channels, and channel split operation is proposed in ShuffleNetV2 [14] to reduce memory access cost.The effect on the ImageNet dataset shows that GhostNetV2 is ahead of other networks in terms of classification accuracy, parameter number and detection speed [22].Therefore, this paper introduces GhostNetV2 into YOLOv5s and proposes the C3GhostV2 module, which has obvious advantages compared with the traditional C3 module and is used together with GhostConv in GhostNet.
GhostNet is a lightweight network designed by Huawei Noah Ark Laboratory in 2020.It can make full use of limited computing resources to extract redundant feature information, and maintain the performance of the network model while reducing the number of model parameters.GhostConv is a convolution module in GhostNet.It can generate enough feature information at the lowest cost through a series of simple linear operations, so as to improve the network's ability to mine original information, and can replace ordinary convolution.However, GhostNet has certain limitations.The convolution and point-by-point convolution in linear transformation have no information exchange with other pixels, and the ability to capture spatial information is weak.In order to improve this shortcoming, Huawei Noah Ark Laboratory launched a new lightweight network structure GhostNetV2 in 2022.The convolution method of decoupled full-connected DFC attention mechanism is added to the Ghost convolution method in parallel, which obtains better accuracy performance while making the network model lightweight.The module of GhostNetV2 Compared with the traditional convolution method, for a given input feature X ∈ H×W×C (H, W, and C are the height, width, and channel number of the feature map respectively), the Ghost module divides the output channels into two parts.The first part is conventional convolution, but strictly controls the number of convolution output layers to generate part of the feature map.The second part generates some other feature maps through linear transformation with low computational cost, and finally stitches the two parts of the feature map to generate the final feature map, so as to eliminate the redundancy of the feature map and obtain a better lightweight model.In deep learning models, many feature maps are similar, and therefore redundant.By reducing this redundancy, the Ghost module is able to reduce the amount of computation and the number of parameters.
In GhostNetV2, the decoupled fully connected attention mechanism DFC is introduced, that is, in the low-rank feature graph, the horizontal and vertical fully connected layers are used to realize the attention graph with a global receptive field.That is, the input feature X is sent to two branches, one is the Ghost branch, which gets the output feature Y; the other is the DFC branch, which gets the attention matrix A.
Finally, the two branches are dot multiplied.Formula 1 illustrates this process.

Structure of GhostNetV2 Bottleneck and C3GhostV2
In GhostNetV2, the decoupled fully connected attention mechanism DFC is introduced, that is, in the low-rank feature graph, the horizontal and vertical fully connected layers are used to realize the attention graph with a global receptive field.That is, the input feature X is sent to two branches, one is the Ghost branch, which gets the output feature Y; the other is the DFC branch, which gets the attention matrix A. Finally, the two branches are dot multiplied.Formula 1 illustrates this process.

   
DFC enables the network to better focus and process spatial information by introducing fully connected attention layers between feature maps.This approach allows the model to capture spatial relationships on a global scale, rather than just local regions.This helps the model to more effectively understand and process spatial structure and content in images.

Figur
The m

Ghos by gr input comp
(1) DFC enables the network to better focus and process spatial information by introducing fully connected attention layers between feature maps.This approach allows the model to capture spatial relationships on a global scale, rather than just local regions.This helps the model to more effectively understand and process spatial structure and content in images.
GhostNetV2 parallelizes network computing by grouping channels, which can adapt to input data of different sizes and have less computing overhead.In addition, the use of low-rank decomposition technology can reduce the number of redundant parameters while ensuring model accuracy.In the backbone network, we use GhostConv and C3GhostV2 structures to ensure accuracy without loss while being lightweight.

Upsampling Operator CARAFE
The nearest neighbor interpolation is used by default in YOLOv5, which determines the upsampling kernel by the spatial position of the pixel point.The semantic information of the feature map is not used, which affects the positioning and recognition of the defect target, cannot achieve the optimal detection effect, and has a small receptive field.In view of these problems, this paper uses the lightweight and efficient upsampling operator CARAFE [23], which maintains lightweight functions with a small number of parameters and calculations.Compared with the nearest neighbor interpolation, it has three main advantages: 1.Large receptive field, able to aggregate context information; 2. Good content awareness ability, dynamically generates adaptive kernels, rather than using a fixed kernel for all samples (e.g.deconvolution), supporting instance-specific content awareness processing; 3. Light weight, fast calculation speed.CARAFE introduces little computing overhead and can be easily integrated into modern network architectures.
CARAFE is divided into two main modules, namely the upsampling kernel prediction module and the feature reorganization module.Assuming the upsampling rate is σ, given an input feature map with shape H×W×C, in the upsampling prediction module, for the input feature image X, the channel compression is first performed through the ordinary convolution operation to generate the compressed image Y with size H×W×C m , which reduces the network calculation.
Set the size of the upsampling kernel as k up ×k up .Combined with the input image size and the upsampling rate σ, the predicted size of the upsampling kernel is obtained through the convolution operation, and the size of the upsampling kernel is σH×σW×k up ×k up .Finally, the softmax algorithm is used to normalize the sampling kernel, so that the sum of the weights of the convolution kernel is 1.For any target position l'(i',j') in the new output graph X', there is a l(i,j) corresponding to it in the original feature graph, and the mapping relationship is i=[i'/σ], j=[ j'/σ].N(X l ,k) is represented as the k×k interval of X with l as the center, and the upsampling kernel prediction module ψ predicts the position kernel W l' of each position l' according to X l .Formula 2 illustrates this process.
upsampling kernel prediction module ψ predicts the position kernel Wl' of each position l' according to Xl. Formula 2 illustrates this process.
In the feature recombination module, for the obtained compressed image Y, the feature map with the corresponding size of kup�kup is taken at the center of the feature map Y and the prediction of the upsampling kernel is performed to perform convolution operation.Finally, the output feature map X' with the size of σH�σW�C is obtained.Formula 3 illustrates this process.
X'l' represents the l'(i',j') position of the new feature map Xl', W represents the convolution kernel, and X represents the original feature map.The original feature map is the region from −r to r with (i,j) as the center.
In summary, CARAFE offers a more advanced upsampling method compared to the default nearest neighbor interpolation in YOLOv5s.Its main advantages are the ability to generate smoother feature maps with richer semantic information, which helps improve the detection accuracy of small objects and the precision of boundary localization.Additionally, CARAFE employs content-aware weighting, allowing the reconstructed feature maps to more accurately reflect the detailed characteristics of the input data.

C3CBAM Attention Mechanism
The attention mechanism module can effectively improve the efficiency of the network by selecting information features, estimating the importance of different information, weakening useless information and strengthening important information.Due to the factors of image occlusion, complex background, and small proportion of distant target image, the detection accuracy will be seriously affected by the real situation.In order to solve these problems, we introduce the attention mechanism.By weighting important features, attention mechanisms can identify and emphasize noise, significantly improving the quality and reliability of detection results.Therefore, the application of attention mechanisms in advanced object detection models such as YOLOv5s is crucial to improve performance in complex visual environments.

CBAM (Convolutional Block Attention
Module) contains two submodules, CAM (Channel Attention Module) and SAM (Spartial Attention Module), which are respectively channel and spatial attention.In the channel attention module, the channel dimension is kept constant, the spatial dimension is compressed, and the meaningful information in the input image is focused on.In the spatial attention module, the spatial dimension is kept constant, the channel dimension is compressed, and the focus is on the location information of the target.

Structure of CBAM attention and C3CBAM module
In this paper, CBAM module is introduced into C3 module to form C3CBAM module, which improves the network's ability to extract the characteristics of the detection target.

Dataset and Evaluation Index
The experimental data set is a mixture of three public datasets: UR Fall Detection Dataset, Fall Detection Dataset (2017 IAPR MVA Conference), and Multiple Cameras Fall Dataset, with a total of 3502 images.Since In the feature recombination module, for the obtained compressed image Y, the feature map with the corresponding size of k up ×k up is taken at the center of the feature map Y and the prediction of the upsampling kernel is performed to perform convolution operation.Finally, the output feature map X' with the size of σH×σW×C is obtained.Formula 3 illustrates this process.
upsampling kernel prediction module ψ predicts the position kernel Wl' of each position l' according to Xl. Formula 2 illustrates this process.
Wl' = ψ�N�Xl,kencoder��. ( In the feature recombination module, for the obtained compressed image Y, the feature map with the corresponding size of kup�kup is taken at the center of the feature map Y and the prediction of the upsampling kernel is performed to perform convolution operation.Finally, the output feature map X' with the size of σH�σW�C is obtained.Formula 3 illustrates this process. X'l' represents the l'(i',j') position of the new feature map Xl', W represents the convolution kernel, and X represents the original feature map.The original feature map is the region from −r to r with (i,j) as the center.
In summary, CARAFE offers a more advanced upsampling method compared to the default nearest neighbor interpolation in YOLOv5s.Its main advantages are the ability to generate smoother feature maps with richer semantic information, which helps improve the detection accuracy of small objects and the precision of boundary localization.Additionally, CARAFE employs content-aware weighting, allowing the reconstructed feature maps to more accurately reflect the detailed characteristics of the input data.

C3CBAM Attention Mechanism
The attention mechanism module can effectively improve the efficiency of the network by selecting information features, estimating the importance of different information, weakening useless information and strengthening important information.Due to the factors of image occlusion, complex background, and small proportion of distant target image, the detection accuracy will be seriously affected by the real situation.In order to solve these problems, we introduce the attention mechanism.By weighting important features, noise, significantly improving the quality and reliability of detection results.Therefore, the application of attention mechanisms in advanced object detection models such as YOLOv5s is crucial to improve performance in complex visual environments.

CBAM (Convolutional Block Attention
Module) contains two submodules, CAM (Channel Attention Module) and SAM (Spartial Attention Module), which are respectively channel and spatial attention.In the channel attention module, the channel dimension is kept constant, the spatial dimension is compressed, and the meaningful information in the input image is focused on.In the spatial attention module, the spatial dimension is kept constant, the channel dimension is compressed, and the focus is on the location information of the target.X' l' represents the l'(i',j') position of the new feature map X l' , W represents the convolution kernel, and X represents the original feature map.The original feature map is the region from −r to r with (i,j) as the center.
In summary, CARAFE offers a more advanced upsampling method compared to the default nearest neighbor interpolation in YOLOv5s.Its main advantages are the ability to generate smoother feature maps with richer semantic information, which helps improve the detection accuracy of small objects and the precision of boundary localization.Additionally, CARAFE employs content-aware weighting, allowing the reconstructed feature maps to more accurately reflect the detailed characteristics of the input data.

C3CBAM Attention Mechanism
The attention mechanism module can effectively improve the efficiency of the network by selecting information features, estimating the importance of different information, weakening useless informa-tion and strengthening important information.Due to the factors of image occlusion, complex background, and small proportion of distant target image, the detection accuracy will be seriously affected by the real situation.In order to solve these problems, we introduce the attention mechanism.By weighting important features, attention mechanisms can identify and emphasize target objects in images, even when these objects are partially occluded or highly fused with the surrounding environment.This enhances the model's ability to capture details, especially when dealing with images containing complex scenes or multiple overlapping targets, and can effectively distinguish foreground targets from background noise, significantly improving the quality and reliability of detection results.Therefore, the application of attention mechanisms in advanced object detection models such as YOLOv5s is crucial to improve performance in complex visual environments.CBAM (Convolutional Block Attention Module) contains two submodules, CAM (Channel Attention Module) and SAM (Spartial Attention Module), which are respectively channel and spatial attention.In the channel attention module, the channel dimension is kept constant, the spatial dimension is compressed, and the meaningful information in the input image is focused on.In the spatial attention module, the spatial dimension is kept constant, the channel dimension is compressed, and the focus is on the location information of the target.

Dataset and Evaluation Index
The Multiple Cameras Fall Dataset: This dataset contains 24 scenes recorded using 8 IP cameras.The dataset focuses on using a multi-camera system to improve the accuracy of fall detection.By collecting data from different angles, the dataset aims to address the perspective limitations that a single camera may encounter.This setup helps to generate a more comprehensive view that can provide more details about fall events, allowing fall detection algorithms to work more accurately in complex environments.
COCO Dataset: This is a widely used large-scale image dataset, which contains more than 200,000 images, dedicated to computer vision tasks such as object detection, human key-point detection, and image description.The dataset was released by the Microsoft team to promote research and development of scene understanding technology.The COCO Dataset provides images of objects in their natural environment, emphasizing the interaction between different objects and the overall understanding of the scene.
By combining these datasets, the strength and robustness of the fall detection system can be significantly improved.Here are some key advantages: 1.These datasets contain data collected from different environments, which improves the generalization ability of the model in various scenarios.In particular, the Multiple Cameras Fall Dataset increases the perspective diversity of the data by providing multiple camera views.In addition, these datasets may involve differ-

Experimental Environment
The experimental computer processor is Intel(R) Core(TM) i7-12650H CPU @ 2.30 GHz, the GPU is NVIDIA A100 80GB PCIe, the operating system is Ubuntu 20.04.2, the deep learning framework is Py-Torch 2.0.1, the Python version is 3.8.0,and the CUDA version is 11.7.
The model training parameters are set as follows: the batch size is 32, the epoch is 100, the imgz is 640, the initial learning rate is 0.01, the weight attenuation coefficient is 0.0005, and the momentum is 0.937.SGD is used as the optimizer for iterative training.

Experimental Evaluation Index
In this paper, average precision (AP) is used as the evaluation index for each defect class, and mean average precision (mAP) is used to evaluate the performance of the entire network model.AP (Average Precision) refers to the area under the PR curve (Precision-Recall Curve), which is the average of the accuracy at different recall points.mAP@0.5 refers to the average of the AP values of all classes when the IoU value is equal to 0.5.mAP@0.5:0.95represents the average mAP at different IoU thresholds (from 0.5 to 0.95, step size 0.05), which takes into account the accuracy P and recall R of object detection.Precision refers to the proportion of all results predicted as positive samples that are correctly predicted, and recall refers to the proportion of all positive samples that are correctly predicted as positive samples.The number of parameters (Params) of the model is used to evaluate the complexity of the model.The smaller the number of parameters, the lighter the model is.The FLOPs index is used to evaluate the computational efficiency of the model.The lower the FLOPs, the higher the computational efficiency of the model is.
parameters (Params) of the model is used to evaluate the complexity of the model.The smaller the number of parameters, the lighter the model is.
The FLOPs index is used to evaluate the computational efficiency of the model.The lower the FLOPs, the higher the computational efficiency of the model is.

T T F P P P P  
(4) PT is the number of positive samples that are correctly predicted; PF is the number of positive samples that are wrongly predicted; NF is the number of negative samples that are wrongly predicted; m is the number of classes that are detected.predicted as positive samples.The number of parameters (Params) of the model is used to evaluate the complexity of the model.The smaller the number of parameters, the lighter the model is.The FLOPs index is used to evaluate the computational efficiency of the model.The lower the FLOPs, the higher the computational efficiency of the model is.

T T F P P P P  
(4) PT is the number of positive samples that are correctly predicted; PF is the number of positive samples that are wrongly predicted; NF is the number of negative samples that are wrongly predicted; m is the number of classes that are detected.

Comp effect
As ca repla comp of the and accur gener After and (5) predicted as positive samples.The number of parameters (Params) of the model is used to evaluate the complexity of the model.The smaller the number of parameters, the lighter the model is.The FLOPs index is used to evaluate the computational efficiency of the model.The lower the FLOPs, the higher the computational efficiency of the model is.

T T F P P P P  
(4) PT is the number of positive samples that are correctly predicted; PF is the number of positive samples that are wrongly predicted; NF is the number of negative samples that are wrongly predicted; m is the number of classes that are detected.predicted as positive samples.The number of parameters (Params) of the model is used to evaluate the complexity of the model.The smaller the number of parameters, the lighter the model is.The FLOPs index is used to evaluate the computational efficiency of the model.The lower the FLOPs, the higher the computational efficiency of the model is.

T T F P P P P  
(4) PT is the number of positive samples that are correctly predicted; PF is the number of positive samples that are wrongly predicted; NF is the number of negative samples that are wrongly predicted; m is the number of classes that are detected.

Comp effect
As ca repla comp of the and accur gener After and (7) P T is the number of positive samples that are correctly predicted; P F is the number of positive samples that are wrongly predicted; N F is the number of negative samples that are wrongly predicted; m is the number of classes that are detected.

Experimental Analysis of Lightweight Improvement
In this paper, the original YOLOv5s model is improved by replacing the ordinary convolution and C3 module with GhostConv and C3GhostV2 modules.In order to verify whether different replacement parts can effectively reduce the amount of model parameters and explore the impact of different replacement parts on the model accuracy, four sets of experiments are designed, which are YOLOv5s, YOLOv5s-all-Ghost, YOLOv5s-backbone-Ghost and YOLOv5sneck-Ghost.The model in which all the Conv and C3 are replaced by GhostConv and C3GhostV2 modules is named YOLOv5s-all-Ghost.The model in which all the Conv and C3 in the backbone are replaced by GhostConv and C3GhostV2 modules is named YOLOv5s-backbone-Ghost.The model in which all the Conv and C3 in the neck are replaced by Ghost-Conv and C3GhostV2 modules is named YOLOv5sneck-Ghost.The comparison results are shown in Table 1.
As can be seen from Table 1, after the model replaces GhostConv and C3GhostV2 modules completely, the parameter amount and FLOPs of the improved model are reduced by 42.74% and 43.04% respectively, the detection accuracy is improved by 0.3%, and the generalization ability of the model is good.After the model backbone replaces GhostConv and C3GhostV2 modules, the parameter amount and FLOPs of the improved model are reduced by 24.93% and 29.75%, respectively, the detection accuracy is improved by 0.6%, and the model has the best generalization ability.
After the YOLOv5s model neck replaces GhostConv and C3GhostV2 modules, the parameter amount and FLOPs of the improved model are reduced by 17.81% and 13.92% respectively, the detection accuracy is reduced by 0.4%, the detection accuracy is reduced by much, and the generalization ability is poor.In order to obtain the best detection accuracy and the best generalization ability of the model, this paper determines the experimental scheme of the YOLOv5s model backbone network replacing C3GhostV2 and Ghost modules, which can also reduce a certain amount of model parameters, and the model is named YOLOv5s-G.

Experimental Analysis of the Improvement of the Upsampling Operator
In order to verify the effectiveness of the upsampling operator CARAFE, the algorithm using the upsampling operator CARAFE on the basis of the algorithm YOLOv5s-G is called YOLOv5s-GC.The upsampling operator CARAFE is used to replace the nearest neighbor interpolation of the original neck network to obtain a higher quality upsampling feature map.In the paper [23], Wang et al. believe that k encoder = k up -2 can achieve the best performance under the condition of a similar amount of calculation, and only by increasing the two at the same time can the performance be improved, but it will also increase the amount of calculation.Therefore, we set the CARAFE operator of k encoder = 3, k up = 5 to seek to improve the model accuracy as much as possible under the condition of a certain amount of calculation.The comparison results are shown in Table 2.
As can be seen from Table 2, after using CARAFE operator, mAP@0.5 and mAP@0.5:0.95 are improved compared with the basic model using the nearest neighbor interpolation, mAP@0.5 is improved by 0.1%, and mAP@0.5:0.95 is improved by 1.1%, indicating that the average mAP of the model at different IoU thresholds is improved, and Precision is improved by 3%, indicating that the accuracy of the model is improved to some extent.The experimental results show that the CARAFE operator can better capture the spatial relationship between features, which can make the model more accurate.

Analysis of C3CBAM Replacement Positions
In order to verify the effectiveness of the C3CBAM module with improved attention mechanism, the algorithm using C3CBAM on the basis of YOLOv5s-GC is called YOLOv5s-GCC.In addition, in order to explore the optimal position of C3CBAM embedding, this paper fused CBAM at the four C3 modules of the neck network, which were labeled as CBAM_A, CBAM_B, CBAM_C and CBAM_D from shallow to deep.The other parts remained unchanged, and the comparison experiment was conducted with YOLOv5s-GC algorithm.The comparison results are shown in Table 3, where "√" indicates the use of an improved method.As can be seen from Table 3, not all fusions of CBAM modules at all positions can improve the detection effect.After the fusion of the deepest CBAM_D part, mAP@0.5 is increased by 0.5%, which is the best effect, and the fusion of the rest parts is not good.
The fusion works best when CBAM is integrated at the top of the PANet, i.e. in the case of CBAM_D.We analyze this result for two main reasons: information, allowing the network to more effectively encapsulate it into the fusion features for prediction.

Comparison to Other Attention Mechanisms
In order to evaluate the improvement effect of the CBAM attention mechanism module selected in this paper, we fused the same C3 position of C3CBAM in YOLOv5s-GC and fused different attention mechanisms of SE and ECA.The SE mechanism focuses on feature weighting between feature channels, learns the importance of each channel, adjusts the weight of the channel according to the importance to enhance important features and suppress features that are not important for the current task.The two fully connected layers of the SE mechanism will reduce the channel size and there is a problem of channel feature loss.In the ECA mechanism, the fully connected layer is removed and one-dimensional convolution is used to complete the information interaction between channels.However, these two attention mechanisms only focus on channel information, while CBAM introduces two analysis dimensions of spatial attention and channel attention at the same time.It not only processes the allocation of feature map channels through channel attention, but also pays more attention to the pixel area in the image that plays a decisive role in classification and ignores the irrelevant area through spatial attention.
The comparison results are shown in Table 4.
As can be seen from Table 4, the maximum mAP can be obtained by integrating the CBAM attention mechanism at the same position, which is 0.5% higher than YOLOv5s.Moreover, compared with other attention mechanisms, the model calculation is smaller and the number of parameters is greatly reduced while improving more accuracy, indicating that CBAM better improves the model performance through the attention mechanism in the two dimensions of space and channel.

Ablation Experiments
In this paper, improvements are made based on the YOLOv5s model, which are lightweight backbone improvement, upsampling operator CARAFE improvement, and C3 module fusion CBAM attention mechanism.In order to fully verify the effectiveness of the improvements proposed in this paper, ablation experiments are conducted on mixed datasets to verify the importance of each improvement.Each improvement is embedded into the YOLOv5s model in turn, and the same training parameters and environmental conditions are used in each set of experiments.The experimental results are shown in Table 5. "√" indicates that a certain improvement method is used.
It can be seen from Table 5 that the detection performance of YOLOv5s model is low.After the improvement of lightweight backbone, FLOPs and Params are greatly reduced, and mAP@0.5 is improved to a certain extent, up 0.6%.We analyze that the reason for the improvement in accuracy while being lightweight is due to the introduction of the DFC attention mechanism in GhostNetV2.The DFC attention mechanism focuses on both global and local information by dynamically adjusting the weights of the convolution kernel to gather information from different locations.This enables the network to capture feature representations with rich contextual information very efficiently without significantly increasing the number of parameters and computational complexity, thereby improving detection accuracy.On this basis, after the improvement of the upsampling operator CARAFE, although the CARAFE operator involves the reorganization and weighting operation of features, resulting in the increase of the amount of calculation and parameter of the model, mAP@0.5 and mAP@0.5:0.95 are improved by 0.1% and 1.1% respectively.Finally, the CBAM attention mechanism is fused to obtain the highest AP value of fall, reaching 0.846.At this time, although mAP@0.5:0.95 is reduced, mAP@0.5 reaches the highest 0.935.It can be seen from the experimental results that after the fusion of CBAM, the amount of calculation and parameter of the model are reduced compared with those before the fusion, and the Params after fusion reach the lowest 5.10M, and the FLOPs are only 11.2G, which achieves the purpose of high-precision fall detection in complex environments with the best effect.These data illustrate that the CBAM module can help reduce the number of parameters by adaptively recalibrating the feature map so that the model can focus on important features and reduce redundancy.By integrating the attention mechanism, it also allows the model to efficiently allocate computing resources by focusing on the rel-

Visual Comparison of Detection Effect of Three Types of Tags Before and After Improvement
It can be seen from Table 5 that, compared with the AP data of each label of YOLOv5s and YOLOv5s-GCC be- The training visualization parameters of YOLO-GCC fore and after the improvement, the AP of stand label decreased by 0.2%, but still reached 0.987, indicating that the accuracy of stand recognition before and after the improvement was very high, with little difference.For the slight drop in accuracy on the stand label caused by our improvement, our analysis is that the addition of attention mechanism often improves the overall performance of the model, but in some specific cases, it may have a slight impact on one or several labels.This is because for labels that already have a high accuracy, the model may have learned enough features for this particular task without using attention.
Introducing attention risks making the model focus too much on certain features and ignoring other equally important information.Since our fall detection is not particularly strict for standing, the impact on this non-critical task may be negligible, so this slight drop in accuracy may not have a significant negative impact on the final actual results.However, the AP of fall label and sit label increased by 2.3% and 1.4% before and after the improvement, respectively, indicating that the model's recognition ability of fall and sit before and after the improvement was greatly improved.The comparison of the detection results is as follows: As can be seen from Figure 10, in different scene environments, the improved model YOLOv5s-GCC has al- As can be seen from Figure 11, in different scenarios, the improved model YOLOv5s-GCC can greatly improve the recognition ability of fall compared with YOLOv5s.
In the first scenario, the detection confidence of fall by YOLOv5s is 0.79, and that of fall by YOLOv5s-GCC is 0.84.In the second scenario, the detection confidence of fall by YOLOv5s is 0.81, and that of fall by YOLOv5s-GCC is 0.91.The improved model has better detection effect on fall than the original model.As can be seen from Figure 12, in different scenarios, the improved model YOLOv5s-GCC has improved the recognition ability of sit compared with YOLOv5s.In the first scenario, the detection confidence of sit by YOLOv5s is 0.89, and that of sit by YOLOv5s-GCC is 0.91.In the second scenario, since the person is in the stand state and about to fall, it should not be detected as sit, but the detection confidence of sit by YOLOv5s is 0.63, and that of stand is 0.53, while that of stand by YOLOv5s-GCC is only 0.74.The improved model is better than the original model in the detection effect of sit.

Comparative Experiments
Compared with traditional models, YOLO series has higher detection accuracy and speed.In order to verify the effectiveness of improving the performance of the model, we trained YOLOv3 and YOLOv3-tiny algorithms on the same dataset under the same training parameters, for comparison with YOLOv5s-GCC.YOLOv3-tiny is a lightweight version of the YOLOv3 model, designed for scenarios that require higher speed and smaller model size.The comparison results are shown in Table 6.
As can be seen from Table 6, YOLOv5s-GCC has the highest mAP@0.5, 1.3% higher than YOLOv3 and 2.2% higher than YOLOv3-tiny; the weight size is the smallest, only The comparison of mAP@0.5 curve during training is as follows: As can be seen from Figure 13, the mAP@0.5 of YOLOv5s-GCC is higher than that of the other two models in most of the last 40 rounds of training, reflecting the effectiveness of the improvement.

Conclusion
In order to solve the problem that the elderly fall indoors and cannot be found in time, this paper proposes a fall detection model for the elderly based on improved YOLOv5s.Compared with YOLOv5s, the improved YOLOv5s-GCC has 1.2% higher mAP@0.5, reaching 0.935; 29.1% lower FLOPs, which is reduced to 11.2G; 27.5% lower Params, which is reduced to 5.09M.While significantly reducing the amount of calculation and parameters of the model, it effectively improves the detection accuracy, and provides a new idea for computer vision to help indoor fall detection.
The follow-up work can be carried out around 3D indoor fall detection, such as building a spatial coordinate system to achieve more accurate fall detection through the change of human joint spatial coordinates.In the development and evaluation of fall detection models, it is necessary to ensure that the privacy of participants is protected, and any data collection and use must comply with ethical standards.We believe that privacy protection technologies can be explored, such as differential privacy mechanism and data desensitization technology, to reduce the infringement of personal privacy.

Figure 3 Figure 4
Figure 3 Comparison between Ghost module and traditional convolution

Figure 6
Figure 6 Structure of CARAFE

Figure 7
Figure 7Structure of CBAM attention and C3CBAM module

5 5 .
. E Ex xp pe er ri im me en nt t a an nd d A An na al ly ys si is s5.1 Dataset and Evaluation IndexThe experimental data set is a mixture of three public datasets: UR Fall Detection Dataset, Fall Detection Dataset (2017 IAPR MVA Conference), and Multiple Cameras Fall

Figure 7
Figure 7 Structure of CBAM attention and C3CBAM module

Figure 10
Figure 10Comparison of stand test results before and after improvement

Figure 12
Figure 12Comparison of sit test results before and after improvement Figure 13 mAP@0.5change curve comparison All the activities of the participants represent 5 different categories of poses that are standing, sitting, lying, bending and crawling.There is only one participant in each image.
UR Fall Detection Dataset: This dataset was developed by the University of Rochester to support research in fall detection and everyday behavior recognition.It contains videos from different angles and in different lighting environments, covering a variety of fall and non-fall scenarios.Individuals in the videos perform various activities such as walking, running, sitting, and falling, to provide diverse data for model training and testing.Fall Detection Dataset (2017 IAPR MVA Conference): This dataset was presented at the 15th IAPR International Conference on Machine Vision Applications in 2017 and was designed specifically for fall detection research.The images in the dataset are recorded in 5 different rooms which consist of 8 different view angles.There are 5 different participants out of which there are two male participants of age 32 and 50 and three female participants of age 19, 28 and 40.

Table 1
Comparison of lightweight improvement effect in different positions

Table 2
Comparison results of upsampling operator

Table 3
Comparison of improvement effects of C3CBAM modules in different positions

Table 4
Comparison of the improvement effect of different attention mechanism modules

Table 5
Comparison of ablation experiments Class Loss (cls_loss) is used to measure the class classification of the target, that is, the category to which the target belongs.Train/loss means the mean loss in the training set, and val/loss means the mean loss in the validation set.Ideally, we hope that the training loss and validation loss are relatively small, and the difference between them is small, which means that the model can not only fit the training data, but also have good generalization ability.

Table 6
YOLOv3 and 13.2% lower than YOLOv3-tiny.It indicates that the improved model YOLOv5s-GCC has higher detection accuracy and faster speed for fall detection, and the weight file is smaller, which is conducive to deployment on other hardware platforms.Comparative experiments 10.1M, 91.8% lower than YOLOv3 and 42.3% lower than YOLOv3-tiny; the calculation amount is the smallest, only 11.2G, 92.8% lower than