Optimization of Human Posture Recognition Based on Multi-view Skeleton Data Fusion

This research introduces a novel method for fusing multi-view skeleton data to address the limitations encountered by a single vision sensor in capturing motion data, such as skeletal jitter, self-pose occlusion, and the reduced accuracy of three-dimensional coordinate data for human skeletal joints due to environmental object occlusion. Our approach employs two Kinect vision sensors concurrently to capture motion data from distinct viewpoints extract skeletal data and subsequently harmonize the two sets of skeleton data into a unified world coordinate system through coordinate conversion. To optimize the fusion process, we assess the contribution of each joint based on human posture orientation and data smoothness, enabling us to fine-tune the weight ratio during data fusion and ultimately produce a dependable representation of human posture. We validate our methodology using the FMS public dataset for data fusion and model training. Experimental findings demonstrate a substantial enhancement in the smoothness of the skeleton data, leading to enhanced data accuracy and an effective improvement in human posture recognition following the application of this data fusion method.


Introduction
Human posture recognition involves the extraction of human skeleton data from video sequences.Computer algorithms utilize this skeleton data to identify specific human posture classifications.This research direction is pivotal within computer vision and finds applications across diverse fields such as human-computer interaction, medical sports rehabilitation, pedestrian recognition, and virtual reality.In This research presents an innovative approach employing two visual sensors to simultaneously capture human body motion data from distinct angles (Figure 1. a).This optimization of the existing single-view human posture recognition method leverages multi-view data fusion techniques to harness the organic complementarity inherent in data collected from various perspectives.After aligning the data from different devices into a common coordinate system, fusion weights are determined based on the angle of the human body relative to the cameras and the smoothness of the data, resulting in the generation of a more trustworthy representation of human posture (Figure 1. b).This enhances the reliability of the fused data and addresses prevalent issues in current algorithms, notably the lack of interactivity in the data fusion process and the overly rigid selection of high-credibility data.The improved data fusion algorithm presented in this article yields more dependable three-dimensional coordinate data for human skeletal joint points, substantially enhancing the accuracy of human posture recognition.
To validate the proposed method, we applied it to the publicly available FMS dataset for fusing the skeleton data collected by two vision sensors and integrated it into the training and testing phases of the human posture recognition algorithm.The principal contributions of this study are twofold: (1) We introduce a data fusion method that combines skeletal data from multiple devices to produce reliable human posture data.This method enhances the accuracy of human posture recognition and bolsters the robustness of the entire data collection system, effectively addressing potential issues such as data loss and data jitter; (2) We apply this fusion method to the FMS dataset, utilizing the merged data with multiple advanced human pose recognition algorithms to further substantiate its effectiveness.joint movements, jitter in joints due to random noise, data loss, and data oscillation, among others.These issues compromise data quality and accuracy, consequently diminishing the precision of human posture recognition.To address these challenges, this paper proposes a data fusion method employing multi-view Ki-that combines skeletal data from multiple devices to produce reliable human posture data.This method enhances the accuracy of human posture recognition and bolsters the robustness of the entire data collection system, effectively addressing potential issues such as data loss and data jitter; (2)

Skeleton-Based Human Posture Recognition
In light of the continuous advancements in deep learning techniques for data analysis and processing, the adoption of deep learning for skeleton-based human posture recognition has emerged as a dominant trend.The Convolutional Neural Network (CNN) [11] approach, while effectively representing the skeleton as a pseudo image and capturing local correlations, is not ideally suited for sequential tasks.Li et al. [13] addressed this limitation by dividing the human skeleton into five segments, transforming them into two-dimensional action images, and subsequently applying image classification techniques.Simonyan and Zisserman [22] first proposed a dual stream framework to capture spatio-temporal information in video frame sequences, this framework consists of two separately running CNNs, one extracting spatial information from a single RGB image and the other extracting motion information from video optical flow sequences, the two sets of features are fused in the final classification layer.Liu et al. [15] added a spatio-temporal interactive learning block in the middle of the network to complete feature fusion in the early stages.Wang et al. [24] changed the spatial feature extraction and recognition of RGB images from single frames to multiple frames, improving the network's ability to describe spatial features.In contrast, the Recurrent Neural Network (RNN) [29] method constructs the skeleton as a sequence of joint coordinate vectors.Wu and Shao [26] introduced a dynamic framework, pioneering the extraction of high-level bone joint features.Meanwhile, Liu et al. [14] maintained a single-stream approach, treating the human body as a tree structure and feeding the human body joint nodes into the Long Short-Term Memory network (LSTM) [8] in a depth-first order traversal manner.This allowed them to capture temporal relationships by stacking LSTM modules.Han and Shan [7] increase feature extraction channels by optimizing feature extraction methods or improving feature extraction efficiency to improve the recognition performance of the model.However, representing the human skeleton as a sequence of vectors or a two-dimensional mesh falls short of capturing the intricate dependencies among interconnected joints.The structure of human joint points and the skeleton naturally forms a graph-like structure, where the Graph Convolutional Network (GCN) [10] emerges as a more adept tool for capturing the intricate relationships among different joint points during human motion.ST-GCN [28] pioneers the application of graph convolution networks to the realm of human action recognition.This groundbreaking approach leverages both spatial edges naturally connecting joint points and the temporal edges stemming from the same joint points across continuous time.It constructs a spatio-temporal graph convolution, significantly enhancing the model's ability to grasp the temporal and spatial relationships.2S-AGCN [21] introduces a dual-level graph representation, featuring a global graph capturing common patterns across all data and individual graphs tailored to unique data instances.This innovative approach overcomes the constraints associated with predefined and unmodifiable skeleton graphs in ST-GCN, introducing a new paradigm known as Adaptive Graph Convolutional Networks.In the same vein, MS-G3D [16] puts forth a multi-scale adjacency matrix and a unified spatio-temporal model.The multi-scale convolution effectively mitigates issues related to biased weighting, while the unified spatio-temporal model introduces cross-spatiotemporal jump connections.Furthermore, it incorporates a time window mechanism in the temporal dimension to enhance the flow of spatio-temporal information.In a parallel development, Shift-GCN [4] introduces the concept of shift convolution [25] to the realm of human motion recognition.By combining the convolution operator with spatial shift operations, this approach simultaneously integrates information from both spatial and channel domains, strategically offsetting channels in the temporal and spatial dimensions.This novel approach more accurately represents human joint constraints and temporal information.Lee et al. [12] sorted all nodes and generated a new adjacency matrix, improving the algorithm's robustness against certain key point exchanges, displacements, or losses.

Multi-view Data Fusion
Currently, research on the fusion of multi-view data is relatively limited.Tong et al. [23] proposed a method utilizing multiple Kinects to scan distinct areas of the human body and fuse them into a comprehensive three-dimensional human body model.Geerse et al. [5] positioned four Kinect devices along one side of a corridor to extract gait parameters and analyze human gait.Müller et al. [17] strategically deployed six Kinects on both sides of a corridor, fusing skeletal data from both sides to enhance three-dimensional reconstruction for gait assessment, thereby improving the accuracy of skeletal data.Guo et al. [6] introduced a dual Kinect data fusion technology based on posture angles.They devised a corresponding data selection mechanism that leveraged the angle relationship between different body posture directions and sensor positions, offering a straightforward and rapid data fusion solution.This approach effectively addressed data loss arising from the self-occlusion of the human skeleton.Jiang et al. [9] proposed a data fusion method based on joint angles.They performed data fusion by calculating the weighted average of data from two Kinect devices placed orthogonally.However, this approach did not account for compensating for data loss in cases of missing data.Peng et al. [18] introduced a fusion algorithm that used redundant data to compensate for occlusion.Nonetheless, there was a substantial disparity between redundant data and original data, leading to poor correlation with previous and subsequent frame data, resulting in reduced action recognition accuracy.Chen et al. [2] adopted a data screening approach based on the physiological constraints of human joints.They exclusively used data from the primary device when it successfully tracked the data and only resorted to data fusion with another device when the primary device failed to track the data.However, the effectiveness of this data fusion approach requires further improvement.

Data Fusion
The visual sensor is proficient in capturing human motion data, and then bone extraction techniques can be used to extract skeletal data, but when utilized as a single sensor for motion capture, it encounters self-occlusion issues that compromise the accuracy of the acquired three-dimensional coordinate data for human skeletal joint points.This research introduces a novel data fusion model predicated on human body posture orientation and data smoothness.It fuses and processes two sets of data acquired from different perspectives.Initially, a coordinate trans-formation is employed to align the joint data captured at varying angles, into a shared world coordinate system, establishing a uniform data platform.Subsequently, the choice of data fusion method depends on the successful data capture by device.If only one camera effectively captures data, that data is utilized directly.However, when both cameras successfully record data, data fusion is implemented.Lastly, the model integrates joint posture orientation and data smoothness to assess the data contributions of the two devices, optimize the data fusion weight coefficients, and generate a reliable representation of human posture.This model significantly enhances the accuracy of motion-captured human posture data and resolves the issue of data loss due to self-occlusion in single-view data capture.

Coordinate Calibration
In practice, when employing a single vision sensor device, it captures human skeleton joint data based on its own camera coordinate system.Therefore, when using two or more vision sensors to collect data, ensuring accurate data fusion necessitates the harmonization of the data acquired by these devices into a common world coordinate system-referred to as coordinate calibration.The primary objective of coordinate calibration is to establish a shared reference coordinate system, guaranteeing uniformity in the coordinate standards across different datasets.This standardization enables consistent processing and analysis of skeletal joint data gathered from diverse devices, serving as a fundamental prerequisite for effective data fusion.It is only when the data exists within the same coordinate system that meaningful data comparisons and fusion can occur, ultimately enabling precise motion capture.
During the process of converting coordinate systems, it is crucial to account for variations in scale, rotation, and translation.In our study, we specifically focus on the transformation between the reference coordinate systems of two Kinects.Given that the parameters and characteristics of the two Kinect sensors are identical, including their scale, our coordinate transformation considerations primarily revolve around rotation and translation transformations.During the coordinate transformation procedure, we apply rotations around different coordinate axes at specific angles to derive the coordinate transformation matrix.
The three-dimensional coordinate conversion merely necessitates knowledge of the corresponding joint point coordinates in the two coordinate systems, allowing us to accurately calculate the rotation matrix R and translation matrix T that facilitate the transformation between these coordinate systems.
We establish the original coordinate system, denoted as S2, of device2 as the reference world coordinate system.Subsequently, we perform the conversion of data obtained from device 1 into the S2 world coordinate system.The corresponding three-dimensional coordinate transformation is articulated in Equation (1).In this equation, the rotation matrix R characterizes the angular adjustment between the new coordinate system and the original coordinate system, while the translation matrix T defines the spatial displacement of the origin of the new coordinate system relative to the old one.Notably, (x, y, z) denotes the coordinates within the original coordinate system S1 of device 1, whereas (x', y', z') represents the new coordinates post-conversion into S2.Consequently, the converted coordinates (x', y', z') in conjunction with the coordinates acquired by device2 collectively pinpoint the three-dimensional coordinate location of the same bone joint point.
within the original coordinate system S1 of device 1, whereas �x′, y′, z′� represents the new coordinates postconversion into S2.Consequently, the converted coordinates �x′, y′, z′� in conjunction with the coordinates acquired by device2 collectively pinpoint the three-dimensional coordinate location of the same bone joint point.' ' ' This study derives the rotation matrix and translation matrix utilizing the geometric principles of Singular Value Decomposition (SVD) Error!Reference source not found.. SVD is a widely employed matrix decomposition technique within the realm of mathematics and computational science.From a geometric perspective, applying singular value decomposition to a matrix is akin to executing a sequence of transformations encompassing rotation, scaling, and vector space adjustments.Our method relies on knowledge of corresponding points in the two coordinate systems, subsequently employing SVD to calculate the rotation and translation matrices.This process facilitates the comprehensive transformation of points from one coordinate system to another, thereby enabling seamless data fusion between disparate coordinate systems.
To streamline the calculations, we begin by relocating both sets of data, centering their centroid coordinates at the origin.This simplification eliminates the need to factor in translation operations, allowing us to solely focus on the rotation operation, aligning the two coordinate systems for straightforward computation of the rotation matrix, denoted as R. The calculation formula for determining the center of mass is illustrated in Formula (2).
1 tr Employ singular value decomposition technology to decompose H (Formula (4)).Where U is an orthogonal matrix, representing a rotation operation, while S is a diagonal matrix with elements along the diagonal representing singular values, indicating a scaling operation.Additionally, V � corresponds to another orthogonal matrix, representing an additional rotation operation.

T H USV 
The rotation matrix can be calculated according to Equation (5).

T R VU 
Once the rotation matrix is determined, it can be employed to compute the translation matrix.This computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).Here, R represents the rotation matrix obtained earlier, while centroid � and centroid � denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.

Data Fusion
Following the coordinate calibration procedure, each joint is characterized by two sets of coordinates, denoted as g �� and g �� .Here, g �� represents the coordinates of the i-th joint after conver- This study derives the rotation matrix and translation matrix utilizing the geometric principles of Singular Value Decomposition (SVD) [1].SVD is a widely employed matrix decomposition technique within the realm of mathematics and computational science.From a geometric perspective, applying singular value decomposition to a matrix is akin to executing a sequence of transformations encompassing rotation, scaling, and vector space adjustments.Our method relies on knowledge of corresponding points in the two coordinate systems, subsequently employing SVD to calculate the rotation and translation matrices.This process facilitates the comprehensive transformation of points from one coordinate system to another, thereby enabling seamless data fusion between disparate coordinate systems.
To streamline the calculations, we begin by relocating both sets of data, centering their centroid coordinates at the origin.This simplification eliminates the need to factor in translation operations, allowing us to solely focus on the rotation operation, aligning the two coordinate systems for straightforward computation of the rotation matrix, denoted as R. The calculation formula for determining the center of mass is illustrated in Formula (2).
method relies on knowledge of corresponding points in the two coordinate systems, subsequently employing SVD to calculate the rotation and translation matrices.This process facilitates the comprehensive transformation of points from one coordinate system to another, thereby enabling seamless data fusion between disparate coordinate systems.
To streamline the calculations, we begin by relocating both sets of data, centering their centroid coordinates at the origin.This simplification eliminates the need to factor in translation operations, allowing us to solely focus on the rotation operation, aligning the two coordinate systems for straightforward computation of the rotation matrix, denoted as R. The calculation formula for determining the center of mass is illustrated in Formula (2).
1 1 tr Among them, A represents the collection of data points obtained from device 1, where A � signifies the coordinates of the i-th joint.The value of N is set to 32, representing the total number of joint points on the human body.The calculation for determining the center of mass for device 2 aligns with the principles described in Equation (3).
We proceed by accumulating two sets of decentralized data, denoted as �A � centroid � � and �B � centroid � �, into a symmetric matrix H. translation matrix Equation (6).Here, trix obtained ear centroid � denote the lected by device 1 tively.Employing c trix calculation yiel stable and accurate, sampling variance o B T centroid 

Data Fusion
Following the coor each joint is charac nates, denoted as g sents the coordinate sion by device 1, w original coordinate d by device 2. To me employ the fusion a 7, where w �� repr signed to the i-th joi nifies the fusion wei device 2.
In instances where t capture data for a sp (2) Among them, A represents the collection of data points obtained from device 1, where A i signifies the coordinates of the i-th joint.The value of N is set to 32, representing the total number of joint points on the human body.The calculation for determining the center of mass for device 2 aligns with the principles described in Equation ( 3).
We proceed by accumulating two sets of decentralized data, denoted as (A -centroid A ) and (B -centroid B ), into a symmetric matrix H.
within the original coordinate system S1 of device 1, whereas �x′, y′, z′� represents the new coordinates postconversion into S2.Consequently, the converted coordinates �x′, y′, z′� in conjunction with the coordinates acquired by device2 collectively pinpoint the three-dimensional coordinate location of the same bone joint point. ' This study derives the rotation matrix and translation matrix utilizing the geometric principles of Singular Value Decomposition (SVD) Error!Reference source not found.. SVD is a widely employed matrix decomposition technique within the realm of mathematics and computational science.From a geometric perspective, applying singular value decomposition to a matrix is akin to executing a sequence of transformations encompassing rotation, scaling, and vector space adjustments.Our method relies on knowledge of corresponding points in the two coordinate systems, subsequently employing SVD to calculate the rotation and translation matrices.This process facilitates the comprehensive transformation of points from one coordinate system to another, thereby enabling seamless data fusion between disparate coordinate systems.
To streamline the calculations, we begin by relocating both sets of data, centering their centroid coordinates at the origin.This simplification eliminates the need to factor in translation operations, allowing us to solely focus on the rotation operation, aligning the two coordinate systems for straightforward computation of the rotation matrix, denoted as R. The calculation formula for determining the center of mass is illustrated in Formula (2).

   
Employ singular value decomposition technology to decompose H (Formula (4)).Where U is an orthogonal matrix, representing a rotation operation, while S is a diagonal matrix with elements along the diagonal representing singular values, indicating a scaling operation.Additionally, V � corresponds to another orthogonal matrix, representing an additional rotation operation.

T H USV 
The rotation matrix can be calculated according to Equation (5).

T R VU 
Once the rotation matrix is determined, it can be employed to compute the translation matrix.This computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).Here, R represents the rotation matrix obtained earlier, while centroid � and centroid � denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.

Data Fusion
Following the coordinate calibration procedure, each joint is characterized by two sets of coordi- Employ singular value decomposition technology to decompose H (Formula (4)).Where U is an orthogonal matrix, representing a rotation operation, while S is a diagonal matrix with elements along the diagonal representing singular values, indicating a scaling operation.Additionally, V T corresponds to another orthogonal matrix, representing an additional rotation operation.
within the original coordinate system S1 of device 1, whereas �x′, y′, z′� represents the new coordinates postconversion into S2.Consequently, the converted coordinates �x′, y′, z′� in conjunction with the coordinates acquired by device2 collectively pinpoint the three-dimensional coordinate location of the same bone joint point.
' ' ' This study derives the rotation matrix and translation matrix utilizing the geometric principles of Singular Value Decomposition (SVD) Error!Reference source not found.. SVD is a widely employed matrix decomposition technique within the realm of mathematics and computational science.From a geometric perspective, applying singular value decomposition to a matrix is akin to executing a sequence of transformations encompassing rotation, scaling, and vector space adjustments.Our method relies on knowledge of corresponding points in the two coordinate systems, subsequently employing SVD to calculate the rotation and translation matrices.This process facilitates the comprehensive transformation of points from one coordinate system to another, thereby enabling seamless data fusion between disparate coordinate systems.
To streamline the calculations, we begin by relocating both sets of data, centering their centroid coordinates at the origin.This simplification eliminates the need to factor in translation operations, allowing us to solely focus on the rotation operation, aligning the two coordinate systems for straightforward computation of the rotation matrix, denoted as R. The calculation formula for deter- Employ singular value decomposition technology to decompose H (Formula (4)).Where U is an orthogonal matrix, representing a rotation operation, while S is a diagonal matrix with elements along the diagonal representing singular values, indicating a scaling operation.Additionally, V � corresponds to another orthogonal matrix, representing an additional rotation operation.

T H USV 
The rotation matrix can be calculated according to Equation (5).

T R VU 
(5) Once the rotation matrix is determined, it can be employed to compute the translation matrix.This computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).Here, R represents the rotation matrix obtained earlier, while centroid � and centroid � denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.

 
The rotation matrix can be calculated according to Equation (5).
within the original coordinate system S1 of device 1, whereas �x′, y′, z′� represents the new coordinates postconversion into S2.Consequently, the converted coordinates �x′, y′, z′� in conjunction with the coordinates acquired by device2 collectively pinpoint the three-dimensional coordinate location of the same bone joint point.
' ' ' This study derives the rotation matrix and translation matrix utilizing the geometric principles of Singular Value Decomposition (SVD) Error!Reference source not found.. SVD is a widely employed matrix decomposition technique within the realm of mathematics and computational science.From a geometric perspective, applying singular value decomposition to a matrix is akin to executing a sequence of transformations encompassing rotation, scaling, and vector space adjustments.Our method relies on knowledge of corresponding points in the two coordinate systems, subsequently employing SVD to calculate the rotation and translation matrices.This process facilitates the comprehensive transformation of points from one coordinate system to another, thereby enabling seamless data fusion between disparate coordinate systems.
To streamline the calculations, we begin by relocating both sets of data, centering their centroid coordinates at the origin.This simplification eliminates the need to factor in translation operations, allowing us to solely focus on the rotation operation, aligning the two coordinate systems for straightforward computation of the rotation Employ singular value decomposition technology to decompose H (Formula (4)).Where U is an orthogonal matrix, representing a rotation operation, while S is a diagonal matrix with elements along the diagonal representing singular values, indicating a scaling operation.Additionally, V � corresponds to another orthogonal matrix, representing an additional rotation operation.

T H USV 
The rotation matrix can be calculated according to Equation (5).

T R VU 
(5) Once the rotation matrix is determined, it can be employed to compute the translation matrix.This computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).Here, R represents the rotation matrix obtained earlier, while centroid � and centroid � denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.
 

Data Fusion
(5) Once the rotation matrix is determined, it can be employed to compute the translation matrix.This computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).
Here, R represents the rotation matrix obtained earlier, while centroid A and centroid B denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.( points coordi-, reprehuman f mass Equatralized �B � computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).Here, R represents the rotation matrix obtained earlier, while centroid � and centroid � denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.

Data Fusion
Following the coordinate calibration procedure, each joint is characterized by two sets of coordinates, denoted as g �� and g �� .Here, g �� represents the coordinates of the i-th joint after conversion by device 1, while g �� corresponds to the original coordinate data for the same joint collected by device 2. To merge these two sets of data, we employ the fusion algorithm outlined in Equation 7, where w �� represents the fusion weight assigned to the i-th joint from device 1, and w �� signifies the fusion weight for the identical joint from device 2.
In instances where the device fails to successfully capture data for a specific joint point within a data (6)

Data Fusion
Following the coordinate calibration procedure, each joint is characterized by two sets of coordinates, denoted as g 1i and g 2i .Here, g 1i represents the coordinates of the i-th joint after conversion by device 1, while g 2i corresponds to the original coordinate data for the same joint collected by device 2. To merge these two sets of data, we employ the fusion algorithm outlined in Equation 7, where w 1i represents the fusion weight assigned to the i-th joint from device 1, and w 2i signifies the fusion weight for the identical joint from device 2.  computation relies on identifying corresponding points between the two datasets, and the specific translation matrix can be computed following Equation (6).Here, R represents the rotation matrix obtained earlier, while centroid � and centroid � denote the central points of the data collected by device 1 and device 2 devices, respectively.Employing centroids for the translation matrix calculation yields results that are both more stable and accurate, as it mitigates the influence of sampling variance or outliers.

Data Fusion
Following the coordinate calibration procedure, each joint is characterized by two sets of coordinates, denoted as g �� and g �� .Here, g �� represents the coordinates of the i-th joint after conversion by device 1, while g �� corresponds to the original coordinate data for the same joint collected by device 2. To merge these two sets of data, we employ the fusion algorithm outlined in Equation 7, where w �� represents the fusion weight assigned to the i-th joint from device 1, and w �� signifies the fusion weight for the identical joint from device 2.
In instances where the device fails to successfully capture data for a specific joint point within a data In instances where the device fails to successfully capture data for a specific joint point within a data frame, it becomes imperative to compensate for this missing data.A commonly employed method involves utilizing the corresponding joint point data from the preceding and subsequent frames and computing their average value to serve as the predicted value for the absent data.This approach offers the advantage of requiring minimal computational effort, ensuring speedy predictions, and achieving an acceptable level of prediction accuracy.
The Kinect SDK provides information regarding the tracking status of skeletal joint points, classifying them into tracked, inferred, and lost statuses.Among these, the data in the tracked status is the most reliable, followed by the inferred status, while data in the lost status cannot be deemed trustworthy.In cases where both Kinect devices are in the tracked or inferred states, their respective fusion weights for skeletal joint point data depend on the posture orientation and data stability of the human body in relation to the two Kinect devices.When only one Kinect is in either the tracked or inferred state, its data should be prioritized, and its fusion weight set to 1.In this case, the data from the other Kinect device, which is experiencing data loss, should be disregarded, with its fusion weight set to 0.
In summary, the fusion process is shown in Figure 2.
It is worth noting that the accuracy of vision sensor data detection is influenced by the orientation of the human body posture.Different orientations yield varying levels of data accuracy.When the human body is directly facing the device, the quality of the data it captures is notably higher, as the sensor can obtain more direct visual information.To quantify the human posture orientation, we generate a posture vector for the human body using the position data of the left and right shoulder joints.The difference between these two positions constitutes the vector N.We then calculate the angles α and β between the human posture vector N and the XOY plane of the two devices using the following formulas: frame, it becomes imperative to compensate for this missing data.A commonly employed method involves utilizing the corresponding joint point data from the preceding and subsequent frames and computing their average value to serve as the predicted value for the absent data.This approach offers the advantage of requiring minimal computational effort, ensuring speedy predictions, and achieving an acceptable level of prediction accuracy.
The Kinect SDK provides information regarding the tracking status of skeletal joint points, classifying them into tracked, inferred, and lost statuses.Among these, the data in the tracked status is the most reliable, followed by the inferred status, while data in the lost status cannot be deemed trustworthy.In cases where both Kinect devices are in the tracked or inferred states, their respective fusion weights for skeletal joint point data depend on the posture orientation and data stability of the human body in relation to the two Kinect devices.When only one Kinect is in either the tracked or inferred state, its data should be prioritized, and its fusion weight set to 1.In this case, the data from the other Kinect device, which is experiencing data loss, should be disregarded, with its fusion weight set to 0.
In summary, the fusion process is shown in Figure 2.

Figure 2
Fusion Process we generate a posture vector for the human body using the position data of the left and right shoulder joints.The difference between these two positions constitutes the vector N.We then calculate the angles α and β between the human posture vector N and the XOY plane of the two devices using the following formulas: Among them, N �� and N �� are the vector N components along the X and Y axes, respectively, in the S1 coordinate system, while N �� and N �� represent the vector N components along the X and Y axes, respectively, in the S2 coordinate system.The angle α characterizes the orientation between the human body's shoulder joint and the front of device1, serving as a measure of the extent to which the human body faces de-vice1.A smaller α signifies a more direct alignment of the human body with Kinect1, thereby enhancing the credibility of the data collected by de-vice1.Similarly, β signifies the angle between the human shoulder joint vector and the front of de-vice2.A smaller β implies a more direct alignment of the human body with device2, resulting in (8) frame, it becomes imperative to compensate for this missing data.A commonly employed method involves utilizing the corresponding joint point data from the preceding and subsequent frames and computing their average value to serve as the predicted value for the absent data.This approach offers the advantage of requiring minimal computational effort, ensuring speedy predictions, and achieving an acceptable level of prediction accuracy.
The Kinect SDK provides information regarding the tracking status of skeletal joint points, classifying them into tracked, inferred, and lost statuses.Among these, the data in the tracked status is the most reliable, followed by the inferred status, while data in the lost status cannot be deemed trustworthy.In cases where both Kinect devices are in the tracked or inferred states, their respective fusion weights for skeletal joint point data depend on the posture orientation and data stability of the human body in relation to the two Kinect devices.When only one Kinect is in either the tracked or inferred state, its data should be prioritized, and its fusion weight set to 1.In this case, the data from the other Kinect device, which is experiencing data loss, should be disregarded, with its fusion weight set to 0.
In summary, the fusion process is shown in Figure 2.

Figure 2
Fusion Process we generate a posture vector for the human body using the position data of the left and right shoulder joints.The difference between these two positions constitutes the vector N.We then calculate the angles α and β between the human posture vector N and the XOY plane of the two devices using the following formulas: Among them, N �� and N �� are the vector N components along the X and Y axes, respectively, in the S1 coordinate system, while N �� and N �� represent the vector N components along the X and Y axes, respectively, in the S2 coordinate system.The angle α characterizes the orientation between the human body's shoulder joint and the front of device1, serving as a measure of the extent to which the human body faces de-vice1.A smaller α signifies a more direct alignment of the human body with Kinect1, thereby enhancing the credibility of the data collected by de-vice1.Similarly, β signifies the angle between the human shoulder joint vector and the front of de-vice2.A smaller β implies a more direct alignment of the human body with device2, resulting in (9) Among them, N x1 and N y1 are the vector N components along the X and Y axes, respectively, in the S1 coordinate system, while N x2 and N y1 represent the vector N components along the X and Y axes, respectively, in the S2 coordinate system.The angle α characterizes the orientation between the human body's shoulder joint and the front of device1, serving as a measure of the extent to which the human body faces nect is in either the tracked or inferred state, its data should be prioritized, and its fusion weight set to 1.In this case, the data from the other Kinect device, which is experiencing data loss, should be disregarded, with its fusion weight set to 0.
In summary, the fusion process is shown in Figure 2. It is worth noting that the accuracy of vision sensor data detection is influenced by the orientation of the human body posture.Different orientations yield varying levels of data accuracy.When the human body is directly facing the device, the quality of the data it captures is notably higher, as the sensor can obtain more direct visual information.device1.A smaller α signifies a more direct alignment of the human body with Kinect1, thereby enhancing the credibility of the data collected by device1.Similarly, β signifies the angle between the human shoulder joint vector and the front of device2.A smaller β implies a more direct alignment of the human body with device2, resulting in increased reliability of the data collected by device2.
Inherent in the data obtained by the vision sensor are noise and outliers that necessitate identification and processing.Given the inherent coherence of human motion sequences, discrepancies between frames corresponding to different actions should generally be modest.This implies that the data should exhibit a certain degree of smoothness.We can assign weights to the fusion operation based on the smoothness of the data, thereby diminishing the influence of outliers.Consequently, the fused skeletal data becomes smoother and more natural, elevating its credibility.By computing the positional error of the same joint point between two consecutive frames, as elucidated in Equation ( 10), we determine the data weight.A larger error indicates a higher likelihood of the data being an outlier with low credibility, warranting a lower weight.Conversely, a smaller error suggests less data irregularity and higher credibility, justifying a higher weight.

weight.
  Considering the factors outlined above, it becomes evident that a smaller angle between the human posture and the vision sensor device corresponds to a reduced smoothness error between frames, resulting in a higher fusion weight.This signifies that when the human posture is more closely aligned with the sensor device or when the data captured from the device exhibits lower inter-frame smoothness error, the device will carry greater weight in the fusion process of skeletal joints.
Adhering to this principle, we can establish a weight for device1, as expressed in Equation ( 11), while a comparable weight for device2 is defined in Equation (12).

Experimental Verification and Analysis 4.1 Dataset
To evaluate the reasonableness and effectiveness of the data fusion method proposed in this paper and to validate remaining 15 participants performed actions as the validation set.

Coordinate Transformation Verification
Verify the feasibility of coordinate conversion by performing coordinate calibration processing on bone data collected from two devices with different perspectives.To this end, we execute coordinate transformations on the data sourced from the FMS public dataset, following the procedure outlined in Section 2.1.This process serves to consolidate the coordinates into a unified world coordinate system.
Considering the factors outlined above, it becomes evident that a smaller angle between the human posture and the vision sensor device corresponds to a reduced smoothness error between frames, resulting in a higher fusion weight.This signifies that when the human posture is more closely aligned with the sensor device or when the data captured from the device exhibits lower inter-frame smoothness error, the device will carry greater weight in the fusion process of skeletal joints.
Adhering to this principle, we can establish a weight for device1, as expressed in Equation ( 11), while a comparable weight for device2 is defined in Equation (12). weight.
  Considering the factors outlined above, it becomes evident that a smaller angle between the human posture and the vision sensor device corresponds to a reduced smoothness error between frames, resulting in a higher fusion weight.This signifies that when the human posture is more closely aligned with the sensor device or when the data captured from the device exhibits lower inter-frame smoothness error, the device will carry greater weight in the fusion process of skeletal joints.
Adhering to this principle, we can establish a weight for device1, as expressed in Equation ( 11), while a comparable weight for device2 is defined in Equation (12).
4. Experimental Verification and Analysis 4.1 Dataset remaining 15 participants performed actions as the validation set.

Coordinate Transformation Verification
Verify the feasibility of coordinate conversion by performing coordinate calibration processing on bone data collected from two devices with different perspectives.To this end, we execute coordinate transformations on the data sourced from the FMS public dataset, following the procedure outlined in Section 2.1.This process serves to consolidate the coordinates into a unified world coordinate system.
To evaluate the effectiveness of the coordinate transformation algorithm, the study randomly selected eight sets of sample data.Using the Spine-Base joints, we compute two key metrics, namely RMSE (Root Mean Square Error) and MAE (Mean Absolute Error), to gauge the performance of the coordinate transformation algorithm.RMSE and MAE serve as measures of the coordinate transformation error, with smaller values signifying higher accuracy in the coordinate transformation process.Experimental results are shown in Tables 1-2, respectively.
  Considering the factors outlined above, it becomes evident that a smaller angle between the human posture and the vision sensor device corresponds to a reduced smoothness error between frames, resulting in a higher fusion weight.This signifies that when the human posture is more closely aligned with the sensor device or when the data captured from the device exhibits lower inter-frame smoothness error, the device will carry greater weight in the fusion process of skeletal joints.
Adhering to this principle, we can establish a weight for device1, as expressed in Equation ( 11), while a comparable weight for device2 is defined in Equation (12).
4. Experimental Verification and Analysis 4.1 Dataset remaining 15 participants performed actions as the validation set.

Coordinate Transformation Verification
Verify the feasibility of coordinate conversion by performing coordinate calibration processing on bone data collected from two devices with different perspectives.To this end, we execute coordinate transformations on the data sourced from the FMS public dataset, following the procedure outlined in Section 2.1.This process serves to consolidate the coordinates into a unified world coordinate system.
To evaluate the effectiveness of the coordinate transformation algorithm, the study randomly selected eight sets of sample data.Using the Spine-Base joints, we compute two key metrics, namely RMSE (Root Mean Square Error) and MAE (Mean Absolute Error), to gauge the performance of the coordinate transformation algorithm.RMSE and MAE serve as measures of the coordinate transformation error, with smaller values signifying higher accuracy in the coordinate transformation process.Experimental results are shown in Tables 1-2, respectively.

Dataset
To evaluate the reasonableness and effectiveness of the data fusion method proposed in this paper and to validate the effectiveness of the fusion algorithm for pose recognition, this paper uses the FMS (functional movement screen) dataset [27] for data fusion and model training.
The FMS dataset contains 3624 motion sequence samples, covering 7 major categories and 15 subcategories.This dataset was captured from two perspectives using two Azure Kinect cameras simultaneously, and can be used to validate the data fusion method proposed in this study.Each perspective contributed 1812 samples, totaling 3624 samples.These exercise samples were completed by 45 volunteers.In addition, two auxiliary Azure Kinect cameras were used to collect color images to supplement the data.Therefore, the dataset contains 3624 sets of color images and 3624 sets of depth images, each corresponding to an action sequence.In order to train and validate the data, this paper divides the dataset into training and validation sets.Specifically, actions performed by 30 participants were used as the training set, while the remaining 15 participants performed actions as the validation set.

Coordinate Transformation Verification
Verify the feasibility of coordinate conversion by performing coordinate calibration processing on bone data collected from two devices with different perspectives.To this end, we execute coordinate transformations on the data sourced from the FMS public dataset, following the procedure outlined in Section 2.1.This process serves to consolidate the coordinates into a unified world coordinate system.
Among them, y is the actual value, while h is the value following coordinate conversion.  Among them, y is the actual value, while h is the value following coordinate conversion.  Among them, y i is the actual value, while h(x i ) is the value following coordinate conversion.According to the results shown in the tables, the maximum root mean square errors for the X, Y, and Z axes are 30.5, 30.8, and 30.5, respectively, while the average maximum root mean square error for the three axes is 27.6.In addition, the maximum average absolute errors of the X, Y, and Z axes are 24.2, 26.4, and 23.2, respectively, while the average maximum absolute error of the three axes is 23.2.
These indicators collectively demonstrate that the data transformation method yields favorable fitting results on this dataset.Specifically, its root mean square error and mean absolute error are both low, indicating that this method has high accuracy.Therefore, it can be concluded that the coordinate transformation method used has shown significant superiority and effectiveness on this dataset.

Data Fusion Experimental Verification
To verify the effectiveness of data fusion methods in compensating for data loss and handling noise issues, we used the standard deviation of all joints for overall validation.Meanwhile, the square root of SpineBase joints was also used to verify the effectiveness of the fusion method in handling individual joint outliers.By calculating the standard deviation of all joints, the overall effectiveness of data fusion methods in handling outliers can be evaluated.Standard deviation is an indicator of the degree of dispersion of points in a dataset, and a larger standard deviation implies greater data dispersion and variability.However, due to the inherent dispersion of human joints, using a total of 32 nodes to calculate the overall standard deviation may result in larger values of standard deviation.Therefore, when evaluating the effectiveness of data fusion algorithms, the continuity and smoothness of standard deviation should be mainly observed to verify whether the algorithm can effectively solve problems such as data missing and jumping, so as to make the generated motion data coherent and consistent.
In addition, we chose to use the square root of Spine-Base joints as an indicator to verify the effectiveness of handling individual joint outliers.The SpineBase joint is located at the base of the human spine, and its movement characteristics have a significant impact on body posture and movement patterns.By observing the outlier handling effect of the joint, we can evaluate the robustness and accuracy of the data fusion method at the individual joint level.In summary, by comprehensively considering the standard deviation of all joints and the outlier processing effect for SpineBase joints, we can comprehensively evaluate and verify the outlier processing effect of the data fusion method.The standard deviation of all joints is shown in Figure 3, and the square root result of the SpineBase joint is shown in Figure 4.In summary, by comprehensively considering the standard deviation of all joints and the outlier processing effect for SpineBase joints, we can comprehensively evaluate and verify the outlier processing effect of the data fusion method.The standard deviation of all joints is shown in Figure 3, and the square root result of the SpineBase joint is shown in Figure 4.   (c) Square root of SpineBase joint in consecutive frames.
As shown in the figures, the data resulting from the fusion process exhibits a relatively comprehensive and continuous profile, effectively addressing issues such as data gaps and discontinuities.These observations affirm the effectiveness of the fusion algorithm proposed in this article.Additionally, data fusion enhances the smoothness of the acquired data, consequently enhancing the reliability of data and accuracy of posture recognition.In order to compare the effect of the data fusion method proposed in this article with the previous fusion method in terms of human posture recognition, we input the fused joint three-dimensional coordinate data into ST-GCN for training.Considering that the direction and length of human bones are crucial for posture recognition, we also input bone data and joint-bone data into ST-GCN for training to compare the results.The experimental results are shown in Table 3.Through the construction of a graph structure representing the human skeleton, ST-GCN harnesses the power of graph convolution operations in the realm of posture recognition.These graph convolution operations excel at aggregating information from proximate regions and disseminating it across the global scale, enabling a more comprehensive understanding of spatial and temporal interdependencies within human poses.
The experimental results demonstrate that the data fusion method proposed in this study outperforms previous methods in human pose recognition tasks.Specifically, this fusion method has achieved superior performance in both joint and joint bone scenarios.These findings provide substantial evidence to support the advantages of the data fusion method.

Human Posture Recognition Experiment
The objective of this experiment is to investigate the influence of data fusion on human posture recognition and validate the efficacy of the data fusion algorithm in enhancing posture recognition accuracy.Our approach involved conducting posture recognition on the original data and on fused skeleton data utilizing various state-of-the-art algorithms.Subsequently, we conducted a comparative analysis of the recognition accuracy, and the results are presented in Table 4.
The aforementioned experimental results clearly demonstrate that the data fusion method proposed in this study leads to a substantial enhancement in the accuracy of human posture recognition.Among the six human posture recognition models examined, the fused data consistently yielded the most favorable outcomes.This validation underscores the efficacy of the data fusion approach presented in this study, particularly in addressing issues related to self-occlusion of skeletal joints and underscores the practical applicability of the study.

Conclusion
This article introduces a data fusion method designed to amalgamate skeletal data collected from two vision sensor devices.The approach leverages human body posture orientation and data smoothness to determine the fusion weights for skeletal data, resulting in more dependable human posture data and an enhancement in the accuracy of skeletal data acquired through the vision sensor.The study draws upon the FMS public datasets for data fusion and the training of advanced human posture recognition algorithms.A comparative analysis of recognition accuracy between the original and fused data validates the efficacy of the proposed data fusion method, contributing significantly to the advancement of the field of human posture recognition.
In future endeavors, we can explore the integration of additional sensor devices for skeletal data fusion, expanding the scope of experiments and investigating optimal device placement strategies to attain the highest data accuracy and reliability.Furthermore, the inclusion of data from other sensors, such as inertial measurement units (IMU) or cameras, in skeletal data fusion is worth considering.This multi-modal data integration has the potential to further elevate the accuracy and resilience of posture recognition, offering broader possibilities for developments in related domain

Figure 1 (
Figure 1 (a) Multi-view data collection (b) Coordinate calibration and data fusion

Figure 2
Figure 2 Fusion Process

Figure 2 Fusion
Figure 2Fusion Process

Figure 4
Figure 4 Square root of SpineBase joint in consecutive frames

Figure 3 (
Figure 3 (a) Standard deviation of X-axis in consecutive frames.(b) Standard deviation of Y-axis in consecutive frames.(c) Standard deviation of Z-axis in consecutive frames

Figure 3 (Figure 4
Figure 3 (a) Standard deviation of X-axis in consecutive frames.(b) Standard deviation of Y-axis in consecutive frames.(c) Standard deviation of Z-axis in consecutive frames.

Figure 5 Figure 4
Figure 5Spatial temporal graph of a skeleton sequence proposed by ST-GCN.

Figure 5
Figure 5Spatial temporal graph of a skeleton sequence proposed by ST-GCN.
FigureSpatial by ST-

4. 4 . Human Posture Recognition Verification 4 . 4 . 1 .
Comparisons with Other Fusion MethodsST-GCN (Spatial-Temporal Graph Convolutional Network[28]) stands out as the pioneer network to introduce graph convolutional networks into the field of human posture recognition.To capture the intricate relationships among human skeletal components, ST-GCN incorporates a graph structure, which serves as a representation of the connections between joints in the human body.It introduces temporal convolution and spatial convolution techniques tailored to this graph structure, enabling the handling of temporal sequencing relationships.Figure5visually delineates the human body graph structure.Within ST-GCN, each joint point is treated as a node within the graph, with the interconnections between them represented by the graph edges.These edges delineate the adjacency relationships among human body joint points, encompassing aspects such as bone connections and limb movement directions.Additionally, in the temporal dimension, the same node corresponding to the human body is linked by supplementary edges, accounting for temporal relationships.Through the construction of a graph structure representing the human skeleton, ST-GCN harnesses the power of graph convolution operations in the realm of posture recognition.These graph convolution operations excel at aggregating information from proximate regions and disseminating it across the global scale, enabling a more comprehensive understanding of spatial and temporal interdependencies within human poses.

Figure 5
Figure 5 Spatial temporal graph of a skeleton sequence proposed by ST-GCN

Figure 5
Figure 5Spatial temporal graph of a skeleton sequence proposed by ST-GCN.
To quantify the human posture orientation, of sample data.Using the Spine-Base joints, we compute two key metrics, namely RMSE (Root Mean Square Error) and MAE (Mean Absolute Error), to gauge the performance of the coordinate transformation algorithm.RMSE and MAE serve as measures of the coordinate transformation error, with smaller values signifying higher accuracy in the coordinate transformation process.Experimental results are shown in Tables1-2, respectively.

Table 3
Comparison of accuracy (%) of different fusion methods on ST-GCN model

Table 4
Comparison of accuracy (%) of data before and after fusion on state-of-the-art methods