Dual-Layer Deep Ensemble Techniques for Classifying Heart Disease

The prevalence of heart disease is increasing at a rapid rate due to changes in food habits and lifestyle of people all over the world. Early prediction and diagnosis of this fatal disease is a highly excruciating task. Nowadays, the ensemble learning approaches are preferred owing to their effectiveness in performance when compared to the performance of a single classification algorithm. In this work, a Dual-Layer Stacking Ensemble (DLSE) technique and a Deep Heterogeneous Ensemble (DHE) technique to classify heart disease are proposed. The DLSE uses several heterogeneous classifiers to form an ensemble that is efficient as well as diverse. The proposed framework consists of two layers with the first layer consisting of three different base learning algorithms Naïve Bayes (NB), Decision Tree (DT), and Support Vector Machine (SVM). The second layer comprises of three different classifiers, Extremely Randomized Trees (ERT), Ada Boost Classifier (ABC) and Random Forest (RF). The second layer utilizes the results from the first layer to provide a diverse input for the three classifiers. Finally, the outcomes are fed to the meta-classifier Gradient Boosted Trees (GBT) to generate the final prediction. The DHE uses three deep learning models Convolutional Neural Networks with Bidirectional Long Short-Term Memory (CNN BiL-STM), Artificial Neural Network (ANN) and Recurrent Neural Network (RNN) with RF, ERT and GBT as the meta-learners. The performance of the proposed methods is compared with traditional state-of-the-art classifiers as well as existing ensemble learning and deep learning methods. The experimental outcomes show that the proposed DLSE and DHE methods perform exceptionally in terms of accuracy, precision and recall measures.


Introduction
The World Health Organization (WHO) has stated that nearly 31% of annual deaths occur because of heart disease [58]. The WHO has also estimated that more than 75% of those deaths occur in middle-and low-income countries [57]. This increase in heart disease is mainly based on the factors such as years of alcohol abuse, smoking, unhealthy food habits, stress, lack of physical activities etc. The changes in the environment such as increase in the level of air pollution, variations in the temperature also play a factor for prevalence of heart disease. It has been estimated that over 54 million people in India suffer from heart related ailments. The recent Coronavirus Disease 2019 (COVID-19) outbreak has raised concern over substantial increase in heart related ailments. The COVID-19 pandemic has increased the risk of severe infection in people with underlying heart disease or heart related problems. Therefore, there is a need for proper classification methodology not only for detecting heart disease but also for predicting the possibility of heart disease in future.
Machine learning [25] has been used extensively by researchers to classify and predict heart disease. Recent technology advancements in parallel processing [12], Graphical Processing Unit (GPU) technology [60] have urged many researchers to utilize this power to process the data more effectively. Ensemble methods [46] are always known to be highly effective in solving classification problems and are the most preferred techniques in the recent days. Ensemble techniques [17] rely on a collection of classifiers rather than focusing on the performance of a single classifier. These approaches build a meta-model based on the results of several diverse classifiers. This meta-model is then used to provide the final prediction outcome for the problem. A wide variety of machine learning algorithms have been developed over the recent years for solving classification and regression problems in real world. Most of the algorithms often deal with increasing the accuracy of classification and prediction.
Many researches were carried out in search of an algorithm that provides high accuracy. The ensemble approaches also fall into this category. Some of the ensemble techniques deal with model fusion, selection of the base learners dynamically, combination of same or different base learners, bagging, applying voting scheme, stacked generalization among others. In this modern era, deep learning models have been successfully applied for classification and prediction tasks as they automate the process of feature extraction using the hierarchical feature learning approach.

Related Work
The ensemble approaches have proven to be more effective when compared to the performance of a single classifier. Some of the recent works in ensemble approaches are discussed in this section. Bashir et. al [7] discussed an ensemble approach using bagging for diagnosing heart disease. The approach used a multi-objective voting scheme for the final prediction result. Al-Barazanchi et. al [1] developed a bagging model for diagnosing neuromuscular disorders. The technique used a Decision Tree as the base learner and a voting mechanism was used to obtain the final prediction. Nilashi et. al [35] proposed an adaptive neuro-fuzzy ensemble model for predicting hepatitis disease. This model used a Self-Organizing Map (SOM) for clustering the data. The major drawback in this method is the computational time that is needed for diagnosing the disease.
Atallah and Al-Mousa [5] developed an ensemble method using the majority voting scheme. Four classifiers were used and the predictions were combined using hard voting method. This approach is just a combination of four basic classifiers using voting scheme and the performance was limited. Ani et. al [3] proposed a rotation forest-based ensemble technique for disease diagnosis. This technique used RF as the base learner. A two-tier classification ensemble for detecting coronary heart disease was explored by Tama et. al [53]. This technique used RF, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Machine (XGBoost) as separate homogeneous ensembles. Yekkala and Dixit [63] designed a Genetic Algorithm (GA) based ensemble for classifying heart disease. This technique used GA for selecting the attributes for classification. But this model was validated on only a single dataset. Brunese et. al [11] provided an ensemble learning method for detecting brain cancer. This method A hybrid ensemble for detecting heart disease was designed by Zhenya and Zhang [67]. This ensemble used five heterogeneous classifiers and used Relief algorithm for dimensionality reduction. This method was tested using the statlog dataset from the UCI data repository. A swarm-based RF algorithm was contributed by Asadi et. al [4]. This technique used a multi-objective particle swarm optimization (MOP-SO) combined with the RF algorithm for diagnosing heart disease. This research suggested the generation of diverse feature sets rather than the traditional bootstrapping of the samples.
An intelligent ensemble method for detecting coronary artery disease was contributed by Sapra et. al [48]. This approach focused on the cost-effectiveness and rapid prediction of heart disease. Marak et. al [31] proposed a semi-supervised ensemble for cancer diagnosis from gene expression data. This method combined the merits of semi-supervised learning and ensemble learning. The model was validated on eight gene expression datasets.
Baccouche et. al [6] proposed a deep learning ensemble model using Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU) model with CNN for the prediction of heart disease. But this technique did not use the benchmark datasets to validate the proposed model. Ali et. al [2] proposed a deep learning-based ensemble model along with feature fusion for predicting heart disease. This approach used conditional probability and information gain for feature weight and feature elimination respectively.
Rath et. al [42] developed a deep learning method for predicting heart disease from the imbalanced ECG samples. This method used Generative Adversarial Network (GAN) model for dealing with the imbalanced samples and used an ensemble of LSTM and GAN for classification. Chen et. al [13] designed a Local Feature based LSTM (LF-LSTM) and a deep learning ensemble for detecting heart rate variability and acceleration. Plawiak et. al [38] proposed a deep ensemble method using genetic algorithm for cardiac arrhythmia detection using ECG signals. This method fused normalization, hamming window, cross-validation for constructing the layers of deep ensemble.
It can be seen from the related works that the ensemble learning can be either homogeneous or heterogeneous. The former will have a single base learning algorithm and the latter will have different base learning algorithms. The choice of the base learner is directly proportional to the effectiveness of the ensemble. This paves the way to carry out extensive research in the area of ensemble classification. Moreover, it can be seen that the deep ensemble models provide a higher performance by utilizing the merits of both ensemble and deep learning models. In this research, a dual layer stacking ensemble that uses three different base learning algorithms in each layer and a deep heterogeneous ensemble are proposed and are applied to diagnose heart disease.

Dual Layer Stacking Ensemble (DLSE)
The proposed DLSE approach involves two layers of base learners and a final meta-learner to provide the final prediction. The Enhanced Evolutionary Feature Selection (EEFS) [40] algorithm is used to select the best feature set from the input training set. The best training set is then subjected to k-fold Cross Validation (CV) and is split into K disjoint subsets of equal size and one set from the K subsets is selected as the validation set. Once the K training sets are constructed the base learners in layer-1 are trained and validated. We have used three classifiers NB, DT and SVM as the base learners in layer-1. The prediction results of all the three classifiers are recorded and all the layer-1 predictions are then combined with the original training set and a new training set is given as input to layer-2 by combining the training set with the prediction matrix generated in layer-1. Layer-1 can be considered as the feature generator for layer-2. This new training set is again subjected to k-fold CV and it results in K disjoint subsets of same size. Once again one subset is chosen at random as the validation set. Now the base learners in layer-2 are trained and validated. In layer-2 we have chosen ensemble classifiers ERT, ABC and RF as base learning algorithms. The second layer uses ensemble classifiers instead of traditional classifiers because the en- semble-based classifiers always provide a better performance than the traditional classifiers [20,22,45,47]. All layer-2 predictions are then used to train the meta-classifier GBT. The meta-learner then provides the final prediction. The flow diagram of the proposed DLSE technique is shown in Figure 1. is tournament with all the other parameters remaining in their default values. The solver for LDA is set as Singular Value Decomposition (SVD) and the remaining parameters are set with their default values.  The first layer of the proposed DLSE method consists of three simple classifiers. Three state-of-the-art classifiers NB [62], DT [27] and SVM [19,64] have [29]. It estimates the class given the observ the highest posterior [50,59] [8,9,26]. Henc for layer-1 of D the algorithm is

Feature Selection Using EEFS
The dataset is feature selected using EEFS algorithm. EEFS is an evolutionary feature selection algorithm that utilizes the advantages of both GA and LDA. This algorithm treats each individual in a population as a binary string that encodes a feature subset. Therefore, for a dataset S of F features, it will be represented as an F-bit binary string. The '1' bits in the F-bit binary string correspond to the features that are selected and the '0' bits correspond to the features that are not selected. Table 1 shows the hyperparameters setting for EEFS algorithm.
The population size is set as 50 with maximum generations being assigned a value of 100. The crossover probability and mutation probability are set as 0.8 and 0.1 respectively. The selection scheme used is tournament with all the other parameters remaining in their default values. The solver for LDA is set as Singular Value Decomposition (SVD) and the remaining parameters are set with their default values.

Layer-1 Base Learners
The first layer of the proposed DLSE method consists of three simple classifiers. Three state-ofthe-art classifiers NB [62], DT [27] and SVM [19,64] have been used as base learning algorithms in layer-1. The Nave Bayes classification algorithm is well-known [29]. It estimates the conditional probability of each class given the observation and chooses the class with the highest posterior probability as the correct answer [50,59]. It is employed in layer-1 because it requires the least amount of storage space to hold the probabilities in both the training and classification stages, making it a good fit for the high-dimensional datasets utilized in our research. SVM is based on statistical learning theory, which has since been improved by a number of other researchers. In SVM, kernel functions are used to map training samples in high-dimensional space in a nonlinear way [56]. For mapping and optimizing the separation between data points, several kernel functions such as polynomial, Gaussian, and sigmoid are utilized. SVM's advantages, such as its success in high-dimensional spaces and flexibility in kernel function selection, have made it appealing for a variety of applications, including disease prediction, speech recognition, and text categorization. The DT classifier uses a tree-like graph and does not require any domain expertise. It creates conditional probabilities for research analysis and selects the optimal option for traversing from root to leaf, indicating distinct class separation [49]. It can be used in the medical industry to classify and forecast diseases. Moreover, the combination of NB, SVM and DT have proven to be very effective in classification [8,9,26]. Hence we have chosen the three classifiers for layer-1 of DLSE. The parameter setting for each of the algorithm is described in Table 2. The DT algorithm uses gain ratio as the criterion for selecting the attributes to split the tree and the maximum depth is set as 10 with the minimum split size being set as 4. The gain ratio Ratio G measure is given by Equation (1), The SVM uses dot kernel with a maximum iteration of 100000 along with the convergence epsilon value 0.001. The rest of the parameters of all the base learners remain in their default values.

Layer-2 Base Learners
In layer-2 three ensemble classifiers ERT [37], ABC [21] and RF [41] are used as base learning algorithms. Two of the most popular averaging methods are RF and ERTs. Before looking for the best features and split spots, it goes through two independent randomized algorithms. To begin, RF randomly selects a fixed number from the training set, similar to bagging [24]. Each decision tree is then grown using a randomly selected subset of input features. RF lowers variance and avoids overfitting by combining the two randomized techniques.
ERT is similar to RF. The bagging approach, on the other hand, is not employed when assigning training samples to each base learner. Instead, each base student is given the same set of training materials. Furthermore, the input feature and its splitting value are picked at random for the building of base learners, whereas RF looks for the highest discriminative thresholds. ABC allows predictors to be learned in a sequential manner. Iterative training is used to change weights for each observation and each base learner, lowering both variation and bias [15]. Moreover, the combination of these classifiers are proven to be effective [65] and hence we have chosen these three classifiers for layer-2 of DLSE. The parameter setting for the layer-2 base learners is shown in Table 3. The ERT classifier uses a random subset just like RF but the random thresholds are set for each candidate feature and the best among the random thresholds is selected as the splitting criteria. The ERT uses averaging to minimize over-fitting and to maximize accuracy. The ABC uses a decision tree as the base estimator with a learning rate of 1. All the learners are configured with 200 estimators and the other parameters remain with default values. The RF uses gini index as the criterion for split with a maximum depth of 10. The gini index Index G is given by Equation (4),  where, g D is the proportion of samples that belongs to the class c for a particular tree node.

Meta-Learner
The GBT [34,39] classifier is used as the metalearner. The meta-learner is a regressor that allows optimization of least squares regression loss function c L . At each stage of the regressor a regression tree is fit based on the negative gradient of the loss function c L . The c L is given by Equation (5), the negative gradient of the loss function c L . The c L is given by Equation (5),  where, c L is the loss for c th ensemble, i p is the prediction for input i e , c-1 E corresponds to the previous ensemble. T corresponds to the estimators used in the ensemble. A newly constructed tree c T is fit accordingly to minimize the loss c L given by previous ensemble c- 1 E as shown in Equations (6)- By using Equation (5), we can rewrite Equation (6) as, T = arg min l p ,E e + T e . (7)  Table 4 DLSE meta-learner hyperparameters setting (2) where, ŷ is the target attribute, proportion of the number of elements in category i over the total number of records S and   , H y S is the entropy measure given by the Equation (3), The SVM uses dot kernel with a maximum iteration of 100000 along with the convergence epsilon value 0.001. The rest of the parameters of all the base learners remain in their default values.

Layer-2 Base Learners
In layer-2 three ensemble classifiers ERT [37], ABC [21] and RF [41] are used as base learning algorithms. Two of the most popular averaging methods are RF and ERTs. Before looking for the best features and split spots, it goes through two independent randomized algorithms. To begin, RF randomly selects a fixed number from the training set, similar to bagging [24]. Each decision tree is then grown using a randomly selected subset of input features. RF lowers variance and avoids overfitting by combining the two randomized techniques.
ERT is similar to RF. The bagging approach, on the other hand, is not employed when assigning training samples to each base learner. Instead, each base student is given the same set of training materials. Furthermore, the input feature and its splitting value are picked at random for the building of base learners, whereas RF looks for the highest discriminative thresholds. ABC allows predictors to be learned in a sequential manner. Iterative training is used to change weights for each observation and each base learner, lowering both variation and bias [15]. Moreover, the combination of these classifiers are proven to be effective [65] and hence we have chosen these three classifiers for layer-2 of DLSE. The parameter setting for the layer-2 base learners is shown in Table 3. The ERT classifier uses a random subset just like RF but the random thresholds are set for each candidate feature and the best among the random thresholds is selected as the splitting criteria. The ERT uses averaging to minimize over-fitting and to maximize accuracy. The ABC uses a decision tree as the base estimator with a learning rate of 1. All the learners are configured with 200 estimators and the other parameters remain with default values. The RF uses gini index as the criterion for split with a maximum depth of 10. The gini index Index G is given by Equation (4),  where, g D is the proportion of samples that belongs to the class c for a particular tree node.

Meta-Learner
The GBT [34,39] classifier is used as the metalearner. The meta-learner is a regressor that allows optimization of least squares regression loss function c L . At each stage of the regressor a regression tree is fit based on the negative gradient of the loss function c L . The c L is given by Equation (5), the negative gradient of the loss function c L . The c L is given by Equation (5),  where, c L is the loss for c th ensemble, i p is the prediction for input i e , c-1 E corresponds to the previous ensemble. T corresponds to the estimators used in the ensemble. A newly constructed tree c T is fit accordingly to minimize the loss c L given by previous ensemble c- 1 E as shown in Equations (6)- By using Equation (5), we can rewrite Equation (6) as, T = arg min l p ,E e + T e . (7)  Table 4 is the proportion of the number of elements in category i over the total number of records S and ( ) , H y S is the entropy measure given by the Equation (3), proportion of the number of elements in category i over the total number of records S and   , H y S is the entropy measure given by the Equation (3), The SVM uses dot kernel with a maximum iteration of 100000 along with the convergence epsilon value 0.001. The rest of the parameters of all the base learners remain in their default values.

Layer-2 Base Learners
In layer-2 three ensemble classifiers ERT [37], ABC [21] and RF [41] are used as base learning algorithms. Two of the most popular averaging methods are RF and ERTs. Before looking for the best features and split spots, it goes through two independent randomized algorithms. To begin, RF randomly selects a fixed number from the training set, similar to bagging [24]. Each decision tree is then grown using a randomly selected subset of input features. RF lowers variance and avoids overfitting by combining the two randomized techniques.
ERT is similar to RF. The bagging approach, on the other hand, is not employed when assigning training samples to each base learner. Instead, each base student is given the same set of training materials. Furthermore, the input feature and its splitting value are picked at random for the building of base learners, whereas RF looks for the highest discriminative thresholds. ABC allows predictors to be learned in a sequential manner. Iterative training is used to change weights for each observation and each base learner, lowering both variation and bias [15]. Moreover, the combination of these classifiers are proven to be effective [65] and hence we have chosen these three classifiers for layer-2 of DLSE. The parameter setting for the layer-2 base learners is shown in Table 3. The ERT classifier uses a random subset just like RF but the random thresholds are set for each candidate feature and the best among the random thresholds is selected as the splitting criteria. The ERT uses averaging to minimize over-fitting and to maximize accuracy.  (4),  where, g D is the proportion of samples that belongs to the class c for a particular tree node.

Meta-Learner
The GBT [34,39] classifier is used as the metalearner. The meta-learner is a regressor that allows optimization of least squares regression loss function c L . At each stage of the regressor a regression tree is fit based on the negative gradient of the loss function c L . The c L is given by Equation (5), the negative gradient of the loss function c L . The c L is given by Equation (5),  where, c L is the loss for c th ensemble, i p is the prediction for input i e , c-1 E corresponds to the previous ensemble. T corresponds to the estimators used in the ensemble. A newly constructed tree c T is fit accordingly to minimize the loss c L given by previous ensemble c- 1 E as shown in Equations (6)-(7).
By using Equation (5), we can rewrite Equation (6) as, T = arg min l p ,E e + T e . (7)  Table 4 DLSE meta-learner hyperparameters setting (3) The SVM uses dot kernel with a maximum iteration of 100000 along with the convergence epsilon value 0.001. The rest of the parameters of all the base learners remain in their default values.

Layer-2 Base Learners
In layer-2 three ensemble classifiers ERT [37], ABC [21] and RF [41] are used as base learning algorithms. Two of the most popular averaging methods are RF and ERTs. Before looking for the best features and split spots, it goes through two independent randomized algorithms. To begin, RF randomly selects a fixed number from the training set, similar to bagging [24]. Each decision tree is then grown using a randomly selected subset of input features. RF lowers variance and avoids overfitting by combining the two randomized techniques.
ERT is similar to RF. The bagging approach, on the other hand, is not employed when assigning training samples to each base learner. Instead, each base student is given the same set of training materials. Furthermore, the input feature and its splitting value are picked at random for the building of base learners, whereas RF looks for the highest discriminative thresholds. ABC allows predictors to be learned in a sequential manner. Iterative training is used to change weights for each observation and each base learner, lowering both variation and bias [15]. Moreover, the combination of these classifiers are proven to be effective [65] and hence we have chosen these three classifiers for layer-2 of DLSE. The parameter setting for the layer-2 base learners is shown in Table 3. The ERT classifier uses a random subset just like RF but the random thresholds are set for each candidate feature and the best among the random thresholds is selected as the splitting criteria. The ERT uses averaging to minimize over-fitting and to maximize accuracy. The ABC uses a decision tree as the base estimator with a learning rate of 1. All the learners are configured with 200 estimators and the other parameters remain with default values. The RF uses gini index as the criterion for split with a maximum depth of 10.
The gini index Index G is given by Equation (4), ∑ (4) where, g D is the proportion of samples that belongs to the class c for a particular tree node.

Meta-Learner
The GBT [34,39] classifier is used as the meta-learner. The meta-learner is a regressor that allows optimization of least squares regression loss function c L . At each stage of the regressor a regression tree is fit based on the negative gradient of the loss function c L . The c L is given by Equation (5), the negative gradient of the loss function c L . The c L is given by Equation (5), ∑ (5)  where, c L is the loss for c th ensemble, i p is the prediction for input i e , c-1 E corresponds to the previous ensemble. T corresponds to the estimators used in the ensemble. A newly constructed tree c T is fit accordingly to minimize the loss c L given by previous ensemble c- 1 E as shown in Equations (6)-(7).
By using Equation (5), we can rewrite Equation (6) as, The parameter setting for the meta-learner is shown in Table 4. The number of estimators is set as 200 and the maximum depth is set to 10 with the learning rate of 0.01. The criterion for measuring the quality of the split used is the Friedman mean squared error fmse R and is given by Equation (8), (8) where, 1

Layer-1 Base Learners
The first layer of the proposed DLSE method consists of three simple classifiers. Three state-of-the-art classifiers NB [62], DT [27] and SVM [19,64] have been used as base learning algorithms in layer-1. The Nave Bayes classification algorithm is well-known [29]. It estimates the conditional probability of each class given the observation and chooses the class with the highest posterior probability as the correct answer [50,59]. It is employed in layer-1 because it requires the least amount of storage space to hold the probabilities in both the training and classification stages, making it a good fit for the high-dimensional datasets utilized in our research. SVM is based on statistical learning theory, which has since been improved by a number of other researchers. In SVM, kernel functions are used to map training samples in high-dimensional space in a nonlinear way [56]. For mapping and optimizing the separation between data points, several kernel functions such as polynomial, Gaussian, and sigmoid are utilized. SVM's advantages, such as its success in high-dimensional spaces and flexibility in kernel function selection, have made it appealing for a variety of applications, including disease prediction, speech recognition, and text categorization. The DT classifier uses a tree-like graph and does not require any domain expertise. It creates conditional probabilities for research analysis and selects the optimal option for traversing from root to leaf, indicating distinct class separation [49]. It can be used in the medical industry to classify and forecast diseases. Moreover, the combination of NB, SVM and DT have proven to be very effective in classification [8,9,26]. Hence we have chosen the three classifiers for layer-1 of DLSE. The parameter setting for each of the algorithm is described in Table 2. The DT algorithm uses gain ratio as the criterion for selecting the attributes to split the tree and the maximum depth is set as 10 with the minimum split size being set as 4. The gain ratio Ratio G measure is given by Equation (1), for the given input i e is given by Equation (9) The parameter setting for the meta-learner is shown in Table 4. The number of estimators is set as 200 and the maximum depth is set to 10 with the learning rate of 0.01. The criterion for measuring the quality of the split used is the Friedman mean squared error fmse R and is given by Equation (8) x n corresponds to the mean output of the n th sub node. The final prediction  i p for the given input i e is given by Equation (9),  where, C corresponds to the number of estimators n_estimators parameter and c T are the estimators also called as weak learners. The meta-learner uses a fixed size of weak learners.

Deep Heterogeneous Ensemble (DHE)
The pseudocode of the proposed DHE algorithm is shown in Algorithm 2. The proposed DHE technique involves one layer of base learners and two layers of meta-learners to provide the final prediction. The first layer consists of three deep learning models CNN This prediction matrix is then used to train the Level-1 meta learners of DHE. The RF and ERT algorithms are chosen as the level-1 meta learners. These two level-1 meta-learners are trained using the base learner prediction matrix Bp. Then the second level predictions are stored to form the level-1 meta learners prediction matrix Mp. This matrix is then fed as the input to the level-2 meta learner GBT. The level-2 meta learner is trained based on the predictions from (9) where, C corresponds to the number of estimators n_ estimators parameter and c T are the estimators also called as weak learners. The meta-learner uses a fixed size of weak learners.

Deep Heterogeneous Ensemble (DHE)
The pseudocode of the proposed DHE algorithm is shown in Algorithm 2. The proposed DHE technique involves one layer of base learners and two layers of meta-learners to provide the final prediction. The first layer consists of three deep learning models CNN BiLSTM, ANN and RNN. The reason for selecting the base learners are deep learning models is from the fact that the deep learning models perform extremely well when the data and the feature sets are higher and also removes the need for manual feature extraction. The large dataset is split into training set U train and testing set U test . The training set U train subjected to 10-fold CV to generate the K training sets U train k . One training set is chosen at random as the validation set .

Algorithm 2
Pseudocode The base learners CNN BiLSTM, ANN and RNN are trained and validated using U train k and respectively. The predicitons of each base learner is recorder to form the base learner prediction matrix B p . This prediction matrix is then used to train the Level-1 meta learners of DHE. The RF and ERT algorithms are chosen as the level-1 meta learners. These two level-1 meta-learners are trained using the base learner prediction matrix B p . Then the second level predictions are stored to form the level-1 meta learners prediction matrix M p . This matrix is then fed as the input to the level-2 meta learner GBT. The level-2 meta learner is trained based on the predictions from Level-1 meta learner and the final prediction r is returned as the output. The process flow of DHE is shown in Figure 2.

Base Learners
The data is trained using three base learners CNN BiLSTM [43], ANN [55] and RNN [68]. The CNN BiLSTM is a hybrid bidirectional LSTM and CNN architecture. The CNN BiLSTM comprises of 8 convolutional layers, 4 dropout layers, 4 dense layers, 3 max pooling layers and 1 normalisation layer. The ANN consists of 4 dense layers, 3 dropout layers and 1 normalisation layer. Finally the RNN comprises of 3 dense layers, 2 dropout layers and 1 normalisation layer.
The proposed DHE method uses deep learning models as base learners and these base learners consist of a number of hyper parameters such as optimizer, learning rate, number of epochs and so on. Five hyper parameters are selected based on their effect on the performance of the deep learning models. The hyper parameters setting for all the base learners is shown in Table 5. In all the three models the activation function was selected as ReLU, the Rectified Linear Unit function. ReLU is one of the most widely used activation function which allows the deep learning models to be trained easily. The next important parameter is the number of epochs used to train the model. The epoch determine the number of times a training sample is selected in order to update the weights. This parameter will lead to over-fitting of the model on the training data set and hence needs to be optimised. The CNN BiLSTM model tend to be stable after 50 epochs and the ANN and RNN models were stable af-

Base Learners
The data is trained using three base learners CNN BiLSTM [43], ANN [55] and RNN [68]. The CNN BiLSTM is a hybrid bidirectional LSTM and CNN architecture. The CNN BiLSTM comprises of 8 convolutional layers, 4 dropout layers, 4 dense layers, 3 max pooling layers and 1 normalisation layer. The ANN consists of 4 dense layers, 3 dropout layers and In all the three models the activation function was selected as ReLU, the Rectified Linear Unit function. ReLU is one of the most widely used activation function which allows the deep learning models to be trained easily. The next important parameter is the number of epochs used to train the model. The epoch determine the number of times a training sample is selected in order to update the weights. This parameter will lead to over-fitting of the model on the training data set and hence needs to be optimised. The CNN BiLSTM model tend to be stable after 50 epochs and the ANN and RNN models were stable after 60 epochs. Another parameter that helps to avoid overfitting problem is the dropout rate. This parameter ter 60 epochs. Another parameter that helps to avoid over-fitting problem is the dropout rate. This parameter ensures the generalisation of the model. The dropout layer allows a fraction of input units to be dropped during training. It ranges between 0 and 1.
In all the three models the activation function was selected as ReLU, the Rectified Linear Unit function. ReLU is one of the most widely used activation function which allows the deep learning models to be trained easily. The next important parameter is the number of epochs used to train the model. The epoch determine the number of times a training sample is selected in order to update the weights. This parameter will lead to over-fitting of the model on the training data set and hence needs to be optimised. The CNN BiLSTM model tend to be stable after 50 epochs and the ANN and RNN models were stable after 60 epochs. Another parameter that helps to avoid over-fitting problem is the dropout rate. This parameter ensures the generalisation of the model. The dropout layer allows a fraction of input units to be dropped during training. The CNN BiLSTM model and ANN model showed highest performance for the dropout rate of 0.2 and the RNN model showed better performance for dropout rate 0.3. In order to reduce the loss function of the deep learning models an optimizer is used. All the three models performed extremely well for the optmizer 'Nadam' which is an Adam optimizer with Nesterov momentum. Finally the learning rate is another parameter that determines the optimization weights of the optimization algorithm. The learning rate for 'Nadam' optimization algorithm was varied and all the three deep learning algorithms showed stable performance for the learning rate of 0.7.

Level-1 and Level-2 Meta Learners
The level-1 meta learners used in DHE are RF and ERT. Both RF and ERT are tree based ensemble classifiers. The RF fits a several number of decision trees on different sub-samples of data. This method uses averaging to avoid over-fitting of data. The ERT works similar to the RF but uses random samples. The hyper parameter settings for both the meta learners is shown in Table 6.
The level-2 meta learner is a single meta estimator GBT. The GBT uses a regression tree based on a loss function shown in Equation (5). The parameter setting for GBT is shown in Table 7.

Performance Evaluation
The experiment is performed using a computer with Intel Core i7 processor having 16 gigabytes of Random-Access Memory (RAM) with a clock speed of 2.71 GHz and an NVIDIA GEFORCE RTX 2070 GPU. Five datasets are used to evaluate the proposed DLSE method out of which three datasets are from the University of California, Irvine data repository, the fourth dataset is from the ricco data repository and the last dataset is taken from the National Health and Nutrition Examination Survey (NHANES) repository. The datasets used to evaluate the proposed DLSE method are described in Table 8. Since the proposed DHE uses deep learning models it is evaluated using three larger datasets with more number of features and data samples. The three datasets used to evaluate the proposed DHE are MIT-BIH Arrhythmia Dataset, The PTB Diagnostic ECG Dataset and Longitudinal EHR dataset. The datasets used to evaluate the proposed DHE method are described in Table 9.
The performance of the model is evaluated using the traditional performance metrics precision, accuracy and recall. The efficiency of the proposed DLSE and DHE methods are measured using a confusion ma-       (11) metrics are given by,   (12) The proposed DLSE and DHE models are validated using k-fold cross validation. For this research the value of k is chosen as 10 making it 10-fold cross validation to estimate the performance of DLSE and DHE. The cross validation is applied on both the layers of DLSE.

ANOVA Statistics
The statistical significance of the model is analysed by the ANalysis Of Variance (ANOVA) statistics. ANOVA Statistics is a statistical test that is used to determine the difference between group means and their variances, such as differences within and across groups. On the same data sets, the F -test is employed to measure the overall deviation pattern. The F-test results indicate which model best matches the supplied data set. The F-test, which is represented by the ANOVA F-test, is also used to determine whether the expected values of provided data sets differ from the values predicted by other classifiers. The value of F is roughly 1 if the null hypothesis is correct, but a large value of F causes the null hypothesis to be rejected. ANOVA condenses all of the data into a single number, F, and assigns a single p to the null hypothesis. The F-test statistics are calculated using the following formula: The statistical significance of the model is analysed by the ANalysis Of Variance (ANOVA) statistics. ANOVA Statistics is a statistical test that is used to determine the difference between group means and their variances, such as differences within and across groups. On the same data sets, the F -test is employed to measure the overall deviation pattern. The F-test results indicate which model best matches the supplied data set. The F-test, which is represented by the ANOVA F-test, is also used to determine whether the expected values of provided data sets differ from the values predicted by other classifiers. The value of F is roughly 1 if the null hypothesis is correct, but a large value of F causes the null hypothesis to be rejected. ANOVA condenses all of the data into a single number, F, and assigns a single p to the null hypothesis. The F-test statistics are calculated using the following formula: The spread of a group of values/distribution is determined by its variability. There are two sorts of variability: between-group and within-group. The collaboration between the examples defines betweengroup variability, which is indicated by S S (BG) for sum of squares between groups. If the instances/samples have modest distances between them, the value of S S (BG) is small, and hence the grand mean is small. The differences within individual samples define within-group variability, which is expressed by S S (WG), which is the sum of squares within groups. Because each sample is considered independently, there is no interaction between them.
In the context of healthcare data, a within group indicates a single group of persons from many groupings. It can be a group of healthy people (class= 0) or patients with cardiac disease (class= 1). Thus, in this context, within group will indicate variability of attribute values within a group of heart disease patients or variability of attribute values within a group of healthy people. Between groups, on the other hand, depict multiple kinds of people from a same medical data collection. As an example, patients from both classes, those with and without heart disease, will be represented in the between group. In ANOVA statistics, the S S , df, MS, F, Fcritical, and p-value are determined. The sum of squares (S S ) is determined across groups using S S (BG) variability and within groups using S S (WG) variability using the formulas: where x is the total values, X is the mean of values, SD is the standard deviation, and n is one of many sample sizes. The spread of a group of values/distribution is determined by its variability. There are two sorts of variability: between-group and within-group. The collaboration between the examples defines between-group variability, which is indicated by S S (BG) for sum of squares between groups. If the instances/ samples have modest distances between them, the value of S S (BG) is small, and hence the grand mean is small. The differences within individual samples define within-group variability, which is expressed by S S (WG), which is the sum of squares within groups. Because each sample is considered independently, there is no interaction between them. In the context of healthcare data, a within group indicates a single group of persons from many groupings. It can be a group of healthy people (class= 0) or patients with cardiac disease (class= 1). Thus, in this context, within group will indicate variability of attribute values within a group of heart disease patients or variability of attribute values within a group of healthy people. Between groups, on the other hand, depict multiple kinds of people from a same medical data collection. As an example, patients from both classes, those with and without heart disease, will be represented in the between group. In ANOVA statistics, the S S , d f , MS, F, F critical , and p-value are determined. The sum Information Technology and Control 2022/1/51 of squares (S S ) is determined across groups using S S (BG) variability and within groups using S S (WG) variability using the formulas: this context, within group will indicate variability of attribute values within a group of heart disease patients or variability of attribute values within a group of healthy people. Between groups, on the other hand, depict multiple kinds of people from a same medical data collection. As an example, patients from both classes, those with and without heart disease, will be represented in the between group. In ANOVA statistics, the S S , df, MS, F, Fcritical, and p-value are determined. The sum of squares (S S ) is determined across groups using S S (BG) variability and within groups using S S (WG) variability using the formulas: where x is the total values, X is the mean of values, SD is the standard deviation, and n is one of many sample sizes. The variable df stands for "degree of freedom," which refers to the number of values in a data collection that are free to vary. Chi-square and hypothesis-testing statistics are widely employed with it. The degree(s) of freedom for the provided data set are used to determine the validity of the null the null hypothesis is true is defined as the p-value. A lower p-value, for example, p < 0.05, denotes a strong presumption against the null hypothesis and more significant results. For hypothesis tests, the p-value is particularly useful for weighing the strength of the evidence. A significant p-value suggests that there is insufficient evidence to reject the null hypothesis, which can never be rejected. The sample findings are usually observed at a significant level (threshold value), which is usually 0.05. However, the bayesian inference approach [16] suggests that this range of values may be optimistic, and thus establishes a new range in which p < 0.001 denotes an algorithm's extreme significance level. By assuming that the null hypothesis is true, the p-value represents the chance of selecting a sample/value from a particular test dataset that is equal to or larger than observed test data sets. A p-value of 0.05 means that given the null hypothesis is true, there is only a 5% chance of drawing the sample being tested. The lower the p value, the more likely the null hypothesis will be rejected. (14) attribute values within a group of heart disease patients or variability of attribute values within a group of healthy people. Between groups, on the other hand, depict multiple kinds of people from a same medical data collection. As an example, patients from both classes, those with and without heart disease, will be represented in the between group. In ANOVA statistics, the S S , df, MS, F, Fcritical, and p-value are determined. The sum of squares (S S ) is determined across groups using S S (BG) variability and within groups using S S (WG) variability using the formulas:

Evaluation of the Proposed DLSE Method
where x is the total values, X is the mean of values, SD is the standard deviation, and n is one of many sample sizes. The variable df stands for "degree of freedom," which refers to the number of values in a data collection that are free to vary. Chi-square and hypothesis-testing statistics are widely employed with it. The degree(s) of freedom for the provided data set are used to determine the validity of the null lower p-value, for example, p < 0.05, denotes a strong presumption against the null hypothesis and more significant results. For hypothesis tests, the p-value is particularly useful for weighing the strength of the evidence. A significant p-value suggests that there is insufficient evidence to reject the null hypothesis, which can never be rejected. The sample findings are usually observed at a significant level (threshold value), which is usually 0.05. However, the bayesian inference approach [16] suggests that this range of values may be optimistic, and thus establishes a new range in which p < 0.001 denotes an algorithm's extreme significance level. By assuming that the null hypothesis is true, the p-value represents the chance of selecting a sample/value from a particular test dataset that is equal to or larger than observed test data sets. A p-value of 0.05 means that given the null hypothesis is true, there is only a 5% chance of drawing the sample being tested. The lower the p value, the more likely the null hypothesis will be rejected. (15) where x is the total values, X is the mean of values, SD is the standard deviation, and n is one of many sample sizes. The variable d f stands for "degree of freedom," which refers to the number of values in a data collection that are free to vary. Chi-square and hypothesis-testing statistics are widely employed with it. The degree(s) of freedom for the provided data set are used to determine the validity of the null hypothesis. Based on a number of variables and samples for the provided dataset, the degree of freedom can then be used to determine if a null hypothesis can be rejected. For both between-group and within-group comparisons, the df is calculated separately. The number of groups minus one equals the "between-group" degree of freedom, which is computed using the formula:

Evaluation of the Proposed DLSE Method
hypothesis. Based on a number of variables and samples for the provided dataset, the degree of freedom can then be used to determine if a null hypothesis can be rejected. For both between-group and within-group comparisons, the df is calculated separately. The number of groups minus one equals the "between-group" degree of freedom, which is computed using the formula: The number of groups is denoted by the letter m. The number of groups multiplied by the number of instances within each group, minus one, equals the degree of freedom "within-group." The following formula is used to compute it: where N signifies the number of samples inside each group and m is the number of groupings. MS stands for mean square, and it is determined for the M S (BG) group and the M S (WG) group. By dividing the S S (BG) by its degrees of freedom, the MS(B) is determined. By dividing the S S (WG) by the degrees of freedom, the M S (WG) is determined. The Fcritical value is a function of the numerator degree of freedom, denominator degree of freedom, and significance level α=0.05. The null hypothesis for ANOVA asserts that all groups have the same average value of the dependent variable (mean). It is always preferable to have an F value that is bigger than the Fcritical value, since if this value is significant enough, we can reject the null hypothesis in favor of the assumption that the classifiers we are comparing truly differ. ANOVA has long been a popular method for reviewing and interpreting medical data in the medical field. The importance of experimental data can also be determined using the p-value. The likelihood of finding a mean difference between groups given that the null hypothesis is true is defined as the p-value. A lower p-value, for example, p < 0.05, denotes a strong presumption against the null hypothesis and more significant results. For hypothesis tests, the p-value is particularly useful for weighing the strength of the evidence. A significant p-value suggests that there is (16) The number of groups is denoted by the letter m. The number of groups multiplied by the number of instances within each group, minus one, equals the degree of freedom "within-group." The following formula is used to compute it: hypothesis. Based on a number of variables and samples for the provided dataset, the degree of freedom can then be used to determine if a null hypothesis can be rejected. For both between-group and within-group comparisons, the df is calculated separately. The number of groups minus one equals the "between-group" degree of freedom, which is computed using the formula: The number of groups is denoted by the letter m. The number of groups multiplied by the number of instances within each group, minus one, equals the degree of freedom "within-group." The following formula is used to compute it: where N signifies the number of samples inside each group and m is the number of groupings. MS stands for mean square, and it is determined for the M S (BG) group and the M S (WG) group. By dividing the S S (BG) by its degrees of freedom, the MS(B) is determined. By dividing the S S (WG) by the degrees of freedom, the M S (WG) is determined. The Fcritical value is a function of the numerator degree of freedom, denominator degree of freedom, and significance level α=0.05. The null hypothesis for ANOVA asserts that all groups have the same average value of the dependent variable (mean). It is always preferable to have an F value that is bigger than the Fcritical value, since if this value is significant enough, we can reject the null hypothesis in favor of the assumption that the classifiers we are comparing truly differ. ANOVA has long been a popular method for reviewing and interpreting medical data in the medical field. The importance of experimental data can also be determined using the p-value. The likelihood of finding a mean difference between groups given that the null hypothesis is true is defined as the p-value. A lower p-value, for example, p < 0.05, denotes a strong presumption against the null hypothesis and more significant results. For hypothesis tests, the p-value is (17) where N signifies the number of samples inside each group and m is the number of groupings. MS stands for mean square, and it is determined for the M S (BG) group and the M S (WG) group. By dividing the S S (BG) by its degrees of freedom, the MS(B) is determined. By dividing the S S (WG) by the degrees of freedom, the M S (WG) is determined. The F critical value is a function of the numerator degree of freedom, denominator degree of freedom, and significance level α=0.05. The null hypothesis for ANOVA asserts that all groups have the same average value of the dependent variable (mean). It is always preferable to have an F value that is bigger than the F critical value, since if this value is significant enough, we can reject the null hypothe-sis in favor of the assumption that the classifiers we are comparing truly differ. ANOVA has long been a popular method for reviewing and interpreting medical data in the medical field. The importance of experimental data can also be determined using the p-value. The likelihood of finding a mean difference between groups given that the null hypothesis is true is defined as the p-value. A lower p-value, for example, p < 0.05, denotes a strong presumption against the null hypothesis and more significant results. For hypothesis tests, the p-value is particularly useful for weighing the strength of the evidence. A significant p-value suggests that there is insufficient evidence to reject the null hypothesis, which can never be rejected. The sample findings are usually observed at a significant level (threshold value), which is usually 0.05. However, the bayesian inference approach [16] suggests that this range of values may be optimistic, and thus establishes a new range in which p < 0.001 denotes an algorithm's extreme significance level. By assuming that the null hypothesis is true, the p-value represents the chance of selecting a sample/value from a particular test dataset that is equal to or larger than observed test data sets. A p-value of 0.05 means that given the null hypothesis is true, there is only a 5% chance of drawing the sample being tested. The lower the p value, the more likely the null hypothesis will be rejected.

Evaluation of the Proposed DLSE Method
The proposed DLSE method is evaluated with traditional single classifiers and also with the existing ensemble techniques and the results are tabulated. We have also compared the DLSE method with a single layer ensemble method comprising of all the classifiers used in both layer-1 and layer-2 (NB, DT, SVM, LR, ERT, ABC and RF) of the proposed DLSE approach.
In the proposed DLSE method, feature selection is applied on the dataset before applying the training set to layer-1. As mentioned before, the evolutionary feature selection algorithm EEFS is used for feature selection. The set of features selected using EEFS are shown in Table 10. The training set with selected features is then passed as input to the layer-1 of DLSE.

Evaluation with Single Classifiers
The performance of DLSE approach with single classifiers is shown in

Evaluation with Other Ensemble Techniques
The results of the evaluation of the proposed DLSE method with the state-of-the-art ensemble techniques is shown in

Evaluation of Single-Layer and Dual-Layer Classification
We have also evaluated the proposed DLSE method with a single layered stacking ensemble using all the six base learners (NB, DT, SVM, LR, ERT, ABC and RF) with the meta-learner being GBT. The results are tabulated and are shown in Table 13. It can be seen that the DLSE method performs better than a single-layered ensemble of base learners in terms of all the performance metrics namely accuracy, precision and recall for all the datasets. The main advantage of using an ensemble of dual-layers is that it provides more flexibility than a single-layer ensemble. Since there are more than one layer, we can use different classifiers in each layer resulting in a more refined classification. There is also a possibility for splitting an imbalanced classification problem in two relatively balanced problems. The dual-layer ensemble is also scalable for training and classifying hierarchically and can be applied to large medical datasets. The hierarchical classification always results in a better performance and quality classification than a simple flat structure. Moreover, the empirical evaluation shows that the dual-layered arrangement of classifiers outperforms the single-layered arrangement of classifiers.

Evaluation of the Proposed DHE Method
The proposed DHE method is evaluated against other popular ensemble techniques such as Boosting, Bagging, Stacking and the results are tabulated. It can be seen from

Analysis of Statistical Significance
The statistical significance of the proposed models is discussed in this section. For a 95% confidence interval, we determined the p-value. The results show that the p-value is significantly lower than the selected threshold of 0.05. It also rejects the null hypothesis, implying that the proposed ensemble classifier outperforms competing classifiers across all datasets. The S S , d f , MS are determined and F, F critical and p-value are calculated and tabulated. Table 15 provide the findings of ANOVA statistics of DHE for MIT-BIH Arrhythmia dataset, PTB Diagnostic ECG dataset and EHR Dataset. Table 16 provide the findings of ANO-VA statistics of DLSE method for Statlog, SPECTF, SPECT, Eric and NHANES datasets. The suggested framework's results are statistically significant, according to ANOVA statistics. Table 15 and 16 present the ANOVA statistics of the proposed ensemble classifiers versus each individual classifier. Each individual classifier is compared to the proposed ensemble classifier, and "between-groups" and "within-groups" variables are calculated. The results show that F value is greater than F critical for all classifiers, indicating that the proposed ensemble classifiers perform well. Furthermore, each classifier's p-value is less than 0.001, indicating that the results for heart disease prediction are strongly significant.

Evaluation of the Proposed Methods with Existing Approaches
The proposed DLSE method was evaluated against the existing ensemble approaches in the literature and the results are tabulated. Table 17 shows the comparison of accuracy of the proposed DLSE method with existing approaches. It can be seen that the proposed DLSE method obtained the highest accuracy for all the datasets used in the research. The proposed DHE method was evaluated against the existing deep ensemble techniques and the results are presented in Table 18. It can be seen that the proposed DHE meth- od has obtained the highest accuracy for all the three datasets considered in this research. Overall, the results clearly portray the effectiveness of both DLSE and DHE methods in diagnosing heart disease.

Conclusion and Future Work
Ensemble techniques are in existence for over a decade and have been used in the domain of machine learning for classification and prediction. These approaches play a significant part in medical diagnosis for prediction and classification of diseases. In this work, dual-layer deep ensemble techniques namely, DLSE and DHE for heart disease classification and prediction were proposed. The proposed DLSE model was applied to five heart disease datasets and the results were analyzed. The proposed method was compared with both traditional single classifiers NB, DT, SVM and LR and also with state-of-the-art ensemble methods Bagging, AdaBoost, RF and GBT. The empir-  ical analysis shows that the proposed DLSE method excels in terms of accuracy, precision and recall. Also, the proposed DLSE was compared with a single-layer stacking ensemble comprising of all the machine learning approaches used in layer-1 and layer-2 of DLSE and the results further prove that the proposed dual-layered ensemble approach has higher accuracy than the traditional machine learning methods. The proposed DLSE method achieved the highest accuracy of 94.21% for Statlog dataset, 92.34% for SPEC-TF dataset, 89.80% for SPECT dataset and 85.04% for the Eric heart dataset. The highest overall accuracy achieved using DLSE method is 95.17% for the NHANES dataset. This strengthens the fact that hierarchical classification always results in a better performance and classification quality than a simple flat structure. The proposed DHE method was compared with other ensemble techniques Bagging, AdaBoost and Stacking. The performance evaluation shows that the proposed DHE method outperforms all the other ensemble methods by achieving an accuracy rate of 99.50% for the MIT-BIH Arrhythmia dataset, 99.87% for the PTB Diagnostic ECG dataset and 98.03% for the EHR dataset respectively. It can also be seen that the proposed DHE method is well-suited for larger datasets with a greater number of features. This also manifests the fact that the proposed DHE utilizes the merits of both deep learning and ensemble techniques. Moreover, at a 95% confidence interval, the F value and p-value derived from ANOVA statistics suggest that the results are statistically significant for all data sets. A major limitation of the proposed approaches is the time taken for training. The training time was not taken into account in the experiment.
Ensemble classifiers require more training time than individual classifiers. Overall, when compared to individual classifiers and earlier research, the suggested ensemble achieved much superior results, suggesting that it may be employed as a viable alternative tool in medical decision-making for heart disease detection.
In future, the proposed DLSE and DHE methods can be applied in classification and prediction of different diseases such as cancer and diabetes. Measures to reduce the training time of DLSE and DHE by applying parallel processing can be investigated. Furthermore, increasing the number of layers in the proposed method and analyzing the performance can also be explored.