Least Squares Support Vector Machine Regression Based on Sparse Samples and Mixture Kernel Learning

Least squares support vector machine (LSSVM) is a machine learning algorithm based on statistical theory. Itsadvantages include robustness and calculation simplicity, and it has good performance in the data processingof small samples. The LSSVM model lacks sparsity and is unable to handle large-scale data problem, this articleproposes an LSSVM method based on mixture kernel learning and sparse samples. This algorithm reduces theinitial training set to a sub-dataset using a sparse selection strategy. It converts the single kernel function in theLSSVM model into a mixed kernel function and optimizes its parameters. The reduced sub-dataset is used fortraining LSSVM. Finally, a group of datasets in the UCI Machine Learning Repository were used to verify theeffectiveness of the proposed algorithm, which is applied to real-world power load data to achieve better fittingand improve the prediction accuracy.


Introduction
Support Vector Machine (SVM) [25] is one of the most important algorithms in the field of machine learning. SVM detection method has been widely employed on account of the advantages of small sample learning, good generalization ability and high accuracy. At present, it is under the background of large samples in the era of big data. Due to its super large amount of calculation in large samples, the attention of SVM has declined, but it is still a commonly used machine learning algorithm [9,18,26]. The applications of the SVM have been significantly increased in the last years in multiple sectors as a successful machine learning approach in modeling the relationship between the input and the output in regression problems [8,30,31].
The main advantages of the SVM algorithm are: (1) It is very effective to solve the classification problem and regression problem of high dimensional features, and it still has a good effect when the feature dimension is greater than the number of samples. (2) Only a part of the support vectors is used to make hyperplane decisions without relying on all data. (3) A large number of kernel functions can be very flexible to solve various nonlinear classification regression problems. (4) When the sample size is not massive data, the classification accuracy is high and the generalization ability is strong.
The main disadvantages of the SVM algorithm are: (1) SVM is not suitable for use when the sample size is very large and the kernel function mapping dimension is very high. (2) There is no universal standard for the choice of kernel function for nonlinear problems, and it is difficult to choose a suitable kernel function.
Least squares support vector machine (LSSVM) [24] is an improved form of SVM, the difference being that SVM is a quadratic programming problem with linear inequality constraints. The calculation process is complex and requires a large computational space. LSSVM is a loss function that uses the sum of error squares as a training set, which is equivalent to converting the quadratic planning problem into a linear equation solution, which makes the problem much easier to solve. Although LSSVM inherits the advantages of SVM, it is with this conversion step that the final decision function is correlated with all the samples, so that LSSVM loses its understanding of the sparseness of the feature. When processing largescale data, with the increase of data sample size and the diversity of structure, computer memory can easily overflow, which affects the prediction accuracy and generalisation ability of the algorithm. As a result of this situation, LSSVM appears to be unable to cope with large sample problems.
To solve the problem of the sparsity of solutions, many researchers have put forward new and improved algorithms, which mainly solve the sparseness problem from the standpoint of the training sample set. Suykens et al. proposed a pruning algorithm based on the size of the support value after LSSVM model training [23]. This algorithm deletes the sample points corresponding to the smaller support value and retains the sample points with the larger support value to decide on the model. The disadvantage of this method is that the model training is performed twice, the solution process is complicated and time-consuming. Subsequently, an LSSVM algorithm with a fixed size sample set and a corresponding improved algorithm [3,4], and the method for combining LSSVM with other machine learning algorithms, have appeared [10,13]. The core concept of these algorithms is to compress large datasets into smaller sub-datasets, and then train them in the LSSVM model [15,16]. Since the reduced sub-sample set carries almost all the important information of the original sample, it can be used as a training sample for the LSSVM model [11,12,14,20]. [11] and [12] solved the problem of fault diagnosis, [20] and [14] proposed a deep structure of LSSVM to solve classification problems. To increase the accuracy, a variety of deep network models based on SVM have been proposed in [6,7,19,21,29] and successfully applied to various classification and regression prediction scenarios. A support vector machine classification algorithm based on depth kernel theory was proposed that can be applied to large-scale data sets [21,29]. A deep learning model based on support vector machines and a probability output network has also been proposed [7,19]. To have a good representation of the data distribution, one can take an algorithm with subset selection, or a random subset as a simpler scheme in [22].
However, these algorithms cannot guarantee sufficiently large reduction datasets, run time is long, and the prediction accuracy is not high. The predic-tion accuracy of LSSVM is also affected by the kernel function and parameters, which makes the selection of kernel function a key consideration. So far, there is no definite theory or method to support how to determine the kernel function and parameters. Improper parameter selection can lead to the problem of overfitting or underfitting the regression model.
To solve the above problems, it is necessary to further research on LSSVM. An LSSVM regression algorithm based on sparse samples and hybrid kernel learning is proposed in this paper. For large-scale data sets, an effective sparse selection strategy is adopted to reduce the large-scale data set to a smaller subset, and the optimization algorithm is used to optimize the mixture kernel function to solve the LSSVM sparsity problem.
The remainder of this article is structured as follows. Section 2 and Section 3 provides a brief review on LSSVM and sparse subset selection strategy, respectively. The proposed method based on (Improved Artificial Bee Colony-Mixture Kernel LSSVM) IABC-MixKLSSVM with sparsity IABC-Mix-KLSSVM (SIABC-MixKLSSVM) is presented in Section 4. In Section 5, the experimental results of the related algorithms are given, and the results are analysed and summarized. x y = , k x R ∈ and k y R ∈ denote the input and output of LSSVM, respectively, and m is the dimension of the input. Then the LSSVM model can be described as solving constrained optimization problems:

Description of LSSVM
where k R α ∈ is the Lagrange multiplier corresponding to the th k sample, and the corresponding sample points are called Support Vectors (SVs). According to the KKT conditions, the equivalent equations are obtained by eliminating vector w and e : where Kernel types and parameters affect the prediction accuracy of the LSSVM algorithm training model, and the selection of kernel functions plays an important role in processing learning tasks. In the LSSVM model, a kernel function is used to map the input data to a high-dimensional space. Because each kernel function has its characteristics and has different effects on the performance of the LSSVM. The two types of kernel functions-Gaussian and polynomial-were combined to create a Mixture Kernel (MixK) functions in the LSSVM model. MixK does not need to change the original mapping space to ensure the effectiveness of its functions [17] ( According to the Mercer condition, K is still a kernel function. Therefore,

( )
, mix k K x x satisfies the kernel function property of Mercer's condition. Therefore, the prediction output of the LSSVM model is:

Sparse Subset Selection
Clustering is a machine learning technique that groups some data points. As one of the most wellknown clustering algorithms, K-Means has the advantages of fast running speed and wide application, but the biggest shortcoming is that it needs to preset the number of clusters and initial points. When the training sample set is large, the K-Means algorithm needs to sort each iteration when calculating the median vector, and the clustering effect is affected. In 1975, Fukunaga et al. proposed a mean-shift algorithm, which is a sub-parameter method based on density gradient rise [5]. It is widely used in target tracking, data clustering, classification [27] and other scenarios. The basic idea is to randomly select an initialisation centre point, calculate the average value of the distance vector from all points to the centre point within a certain range of the centre point, and then calculate the average value to obtain an offset mean. Then move the centre point to the offset mean position, and through this repeated movement, the centre point can be gradually approached to the best position. This idea is similar to the gradient drop method, which can reach the local or global optimal solution of the gradient by constantly moving towards the gradient descent [1,2]. The geometric explanation is as follows: if the sample point i x obeys the distribution of a probability density function ( ) f x , because the gradient of the non-zero probability density function points to the direction where the probability density increases the most, the sample points in the h S region fall more along the direction of the probability density gradient, the mean-shift vector ( ) h M x should point to the direction of the probability density gradient [28]. In other words, the mean shift algorithm is essentially a gradient-based optimisation algorithm.
Given a point i x in d dimensional space, then the basic form of a mean shift vector is defined as: where k indicates that k points fall into the k S area. k S is a high-dimensional sphere with radius h , a set of y points satisfying the following relationship: With the increase of distance, the smaller the effect is. Thus, there are the following improvements: . .
The Mean-shift clustering movement process is shown in Figure 1.

Figure 1
The mean-shift clustering movement process g towards the gradient descent [1,2]. The etric explanation is as follows: if the sample i x obeys the distribution of a probability ty function ( ) f x , because the gradient of the ero probability density function points to the ion where the probability density increases ost, the sample points in the h S region fall along the direction of the probability density nt, the mean-shift vector ( ) h M x should to the direction of the probability density nt [28]. In other words, the mean shift thm is essentially a gradient-based isation algorithm. a point i x in d dimensional space, then sic form of a mean shift vector is defined as: Because LSSVM lacks sparsity and is not suitable for large-scale datasets, this article uses the mean-shift clustering method to obtain sparse subsets, which is beneficial to LSSVM model training and prediction. The Because LSSVM lacks sparsity and is not suitable for large-scale datasets, this article uses the mean-shift clustering method to obtain sparse subsets, which is beneficial to LSSVM model training and prediction. The process is shown in Algorithm 1:

Algorithm 1. Mean-shift algorithm process
Step 1 Select an initial center c randomly in a given data set, h is the radius of h S , the threshold of the shift is σ .
Step 2 Calculate c to each element in the set M , and add these vectors to get the vector shift .
Step 3 Update the center point c c shift = + , the moving distance is shift .
Step 4 Repeat Steps 2,3,4 until the shift converges to a σ , obtain c at this time.
Step 5 Calculate the distance d between c and center c of the last iteration, generate a new cluster point.

SIABC-MixKLSSVM
To solve the problem that the LSSVM model is not suitable for large-scale datasets, this article uses the sparse strategy to reduce the dataset, which decreases the computational cost and complexity. The parameters of the LSSVM model with a mixture kernel are optimized by IABC to improve the accuracy of regression prediction. Figure 2 shows the schematic diagram of SIABC-MixKLSSVM, and Algorithm 2 provides the detailed process of SIABC-MixKLSSVM.
Step 2 Initialize the whole parameters of the proposed method Step 3 Selection subset by Algorithm 1, Obtain the final training datasets.
Step 4 Utilize IABC to select appropriate kernel functions and optimize the corresponding parameters on the final datasets.
Step 5 Obtain the model with the best parameters.
Step 6 Test the model.
Step2 Initialize the whole parameters of the proposed method Step3 Selection subset by Algorithm 1, Obtain the final training datasets.
Step4 Utilize IABC to select appropriate kernel functions and optimize the corresponding parameters on the final datasets.
Step5 Obtain the model with the best parameters.
Step6 Test the model.

Experiments
To test the performance of the proposed algorithm, we used the dataset in UCI Machine Learning Repository to test various algorithms and analyse the results. All experiments are conducted on an Intel Core i7-3770 CPU @3.20GHz processor with 4GB RAM in a MATLAB 2018a environment. To avoid randomness in the experimental results, each data and model must be run 10 times. All given input datasets are normalized to zero mean and unit variance. The kernel function and optimization parameter setting and value range used in the algorithm are shown in Table 2.
In addition, when the mean shift method is used to sparse samples, the spherical radius h of h S need to be set manually, and its value varies with the size of the data set,

Experiments
To test the performance of the proposed algorithm, we used the dataset in UCI Machine Learning Repository to test various algorithms and analyse the results. All experiments are conducted on an Intel Core i7-3770 CPU @3.20GHz processor with 4GB RAM in a MATLAB 2018a environment. To avoid randomness in the experimental results, each data and model must be run 10 times. All given input datasets are normalized to zero mean and unit variance. The kernel function and optimization parameter setting and value range used in the algorithm are shown in Table 2. In addition, when the mean shift method is used to sparse samples, the spherical radius h of h S need to be set manually, and its value varies with the size of the data set,

Test Data and Evaluation Indicators
The data for this experiment comes from 14

Experimental Results and Analysis
To verify the effectiveness of the mean-shift clustering algorithm proposed in this article, and to sparse the dataset. We randomly generate 450 points in a two-dimensional space to form a visual dataset, as shown in Figure 3. The blue circles represent all the points in the dataset, and the pink asterisks represent the updated points after each iteration. The pink asterisks form the trajectory of the mean-shift method in finding the extreme points. According to the description of Algorithm 1, Figure 3 shows the process of finding cluster points by the mean shift method.
Randomly select a center point in the data set, take the spherical area with a radius h as the initial set M , and move along the direction of increasing initial point density. This process is repeated until the moving distance shift σ < ( , and then stop and record the center at this time. The distance d between the center at the time and the center of the previous iteration are calculated. If 2 d d < , the two centers will be merged, otherwise, the center point at this time is a new cluster point. Repeat the above process and finally obtain five clusters as shown in Figure 3. Figure 4 shows the final clustering result obtained by the mean-shift method. Five colors represent five clusters, and the mean-shift method is continuously updated and iterated, and finally five cluster centers are obtained, with black asterisk marked.   Clustering results of the mean-shift method KLSSVM method can achieve effective predictability and high predictive ability for large-scale datasets. Table 3 shows the comparison results of different algorithms based on the LSSVM model and three popular algorithms in the training set and the testing set. In the training set, the MAE and STDE of SIABC-MixKLSSVM are smaller, and the RMSE and MAPE of S-LSSVM are smaller. The overall effect of the latter algorithm is better. In the test set, in addition to MAPE, the evaluation index of SIABC-Mix-KLSSVM is smaller and the performance is better. In summary, whether on a training set or a prediction set, the prediction effect of SIABC-MixKLSSVM is the best of all the algorithms. The performance of the algorithm based on the LSSVM model is better than the other three algorithms, especially compared with the algorithm with a deep structure, the result of the SIABC-MixKLSSVM algorithm is better than the BP and ELM algorithm, which shows the effectiveness and feasibility of the algorithm proposed in this paper. Table 4 shows the test results of five evaluation indicators in five algorithms for data sets of different sizes in the UCI database. In table 4, bold indicates that the smaller the value, the better the prediction effect (that is, the optimal effect). It can be seen that compared with the five given algorithms, the values of the five evaluation indices in the S-LSSVM are higher, indicating that the performance is the worst in all datasets. By comparing S-LSSVM and SIABC-MixKLSSVM,    it is not difficult to find that sample sparseness is performed in both algorithms. Although it is a bit longer in terms of TIME, SIABC-MixKLSSVM predicts better results. This shows that optimizing the kernel function of the sparse LSSVM model helps to improve the prediction results. The RVM algorithm performs better in the Energy Efficiency (Cooling), and Bias Maximum temperature dataset. In other data sets, the evaluation indicators of SIABC-MixKLSSVM are smaller than RVM, and the performance is better than RVM. This shows that when the amount of data is very small, RVM has the advantage of being more sparse than SVM. As the amount of data increases, the accuracy of RVM drops significantly, which is unsuitable. Compared with BP and ELM algorithms, except that ELM performs better in White Wine Quality and Bias Minimum temperature datasets, SIABC-Mix-KLSSVM has the best evaluation indicators in other data sets. Especially when the amount of data is more than ten thousand levels, the sparse strategy is used to reduce the data set, which greatly shortens the running time of the algorithm, optimizes the mixed kernel function parameters of the LSSVM model, and improves the prediction accuracy. It can be seen from the above analysis that the method proposed in this paper is effective and feasible. This shows that the algorithm in this paper can not only sparse samples, but also improve the prediction ability of the algorithm by optimizing the parameters of the mixed kernel function, and it has better competitiveness. If the feature dimension is much larger than the number of samples, the prediction accuracy of this algorithm is not high.

6.Conclusions
LSSVM is an improved version of the SVM algorithm, but it lacks the sparsity of SVM, its single kernel function leads to low generalisation ability and accuracy in the case of a large dataset. In response to this situation, the authors used a sparsity strategy to reduce the initial samples in a subset, to improve the sparsity of the kernel function, and to solve the poor sparsity problem of LSSVM in the case of a large dataset. The single kernel function in the LSSVM model was changed to a mixed kernel function and the IABC algorithm was used to optimize the parameters, which improves the prediction accuracy. In the standard UCI dataset, the experimental results and analysis show that the sparse selection strategy can effectively solve the problem that the LSSVM model is not suitable for large-scale datasets and the SIABC-MixKLSSVM proposed in this article is effective. At the same time, the algorithm was applied to the real-world power load data. The SIABC-Mix-KLSSVM also achieves a better fitting effect, which shows that the algorithm has higher forecasting accuracy. The future work is due to the idea of deep learning, combining the traditional SVM algorithm and deep structure to form a multi-layer LSSVM model to solve specific problems such as time series forecasting and bearing fault diagnosis.