Development of Proposed Ensemble Model for Spam e-mail Classification

Spam e-mail documents classification is a very challenging task for e-mail users, especially non IT users. Billionsof people using the internet and face the problem of spam e-mails. The automatic identification and classificationof spam e-mails help to reduce the problem of e-mail users in managing a large amount of e-mails. This work aimsto do a significant contribution by building a robust model for classification of spam e-mail documents using datamining techniques. In this paper, we use Enorn1 data set which consists of spam and ham documents collectedfrom Kaggle repository. We propose an Ensemble Model-1 that is an ensemble of Multilayer Perceptron (MLP),Naïve Bayes and Random Forest (RF) to obtain better accuracy for the classification of spam and hame-mail documents.Experimental results reveal that the proposed Ensemble Model-1 outperforms other existing classifiers aswell as other proposed ensemble models in terms of classification accuracy. The suggested and proposed EnsembleModel-1 produces a high accuracy of 97.25% for classification of spam e-mail documents.


Introduction
Many of the previous research work on data mining have focused on structured data. However, in fact, text databases store a valuable section of available information. The text database is a collection of huge amount of documents collected from various sources like news stories, books, digital library, research papers, e-mails, web pages and various social media sites.
These days, a vast majority of data in government, industry, business, and different organizations are put away electronically, as text databases [23]. The entire world is using new technologies for communicating all over the world where e-mail is one of the significant and fast communication media through which we can share information from one e-mail user to another. The main reason why spam e-mails are continuously increasing in mailbox is lack of awareness among the Internet users. Due to this problem, the spam e-mail text (documents) classification is of significance in research work.
Various reputed labs generated report of spam e-mails of every quarter to create awareness in every Internet user. According to Kaspersky Lab report in the first quarter (Q1, 2018) [44], the highest source of spam generating country was Vietnam with 9% spam e-mails, while India was in the 4th position with 7.1% and the average percentage of spam in global e-mail traffic was 51.82%. In the second quarter (Q2, 2018) [44], the highest source of spam generating country was China with 14.36% spam e-mails, while India was in the 11th position with 2.11% and the percentage of spam e-mail traffic in the world was 49.66%. In the third quarter (Q3, 2018) [44], the highest source of spam generating country was China with 13.47% spam e-mails, while India was in the 9th position with 2.84% and the percentage of spam e-mail traffic globally was 52.54%. According to Kaspersky Lab report for the first quarter (Q1, 2019) [44], the highest source of spam was China with 15% spam e-mails, while India was in the 9th position with 2% and the average percentage of spam in the global e-mail traffic was 55.97%.
Spam e-mail is garbage e-mail sent by spammers for their own true intension. These immense quantities of spam e-mails are making a major issue as far as correspondence data transmission use, extra space in mail box and time expended to erase or keep up and maintain.
In a nutshell, this research work contributes the following: 1 Pre-process of Enron1 data set. 2 Analyse the different individuals and well known ensemble data mining based classification techniques using Enron1 data set. 3 Development of the proposed ensemble model based on data mining based classification techniques. 4 Comparative analysis with other existing developed models.
The remaining part of this paper is organized as follows: Section 2 explores the review of literature related to spam e-mail classification, Section 3 explores the framework of spam e-mail classification using the proposed method and also explores different methods and materials used in this research work, Section 4 elaborates the experimental results, Section 5 analyses the results and finally Section 6 concludes the research work and also gives the future direction.

Related Works
Many researchers have worked in the area of spam e-mail classification using different machine learning techniques and their findings and results are very important to be taken as reference for exploring the new dimension of research work.
Dedeturk and Akay [14] proposed a new spam detection technique through a combination of artificial bee colony algorithm with a logistic regression technique and they also worked on three different datasets to upgrade and handle high-dimensional data with high accuracy. Saidani et al. [32] suggested and used text semantic analysis to improve the performance of model for spam detection. They also suggested automatically extracted semantic features selection technique for spam detection in respective domain. Harisinghaney et al. [17] discussed the detection and classification of text as well as image based e-mail and ham data. They used three classification algorithms namely K-Nearest Neighbors, Naive Bayes and reverse DB-SCAN al- gorithm for classification of spam e-mails. The performance of these classifiers were evaluated before and after preprocessing of data and produced satisfactory results in terms of accuracy, precision, sensitivity and specificity. Méndez et al. [25] suggested feature selection based semantic ontology to form groups of words for filtering spam e-mails. They used Latent Dirichlet Allocation, information gain, generative statistical model and semantics based feature selection techniques to design spam e-mails filter. Kauret al. [21] focused on two interlinked problems for representing spam detection and classification. They also explored the various research gaps through this paper for future scope. Dalkilic and Sipahi [13] developed a spam detection model for analyzing the IP address of A and MX records using Sender Policy Framework (SPF) protocol. Barushka and Hajek [5] proposed a novel spam filter approach known as DBB-RDNN. They also compared the performance of proposed spam filtering techniques with different machine learning approaches and achieved better accuracy. Palivalet al. [30] studied the limitations of spam blacklisting system and signature based system and proposed the ID3 algorithm which is based on decision tree technique for spam filtering. The algorithm produced better accuracy compared to the others. Varghese et al. [42] suggested Naïve Bayes classification algorithms using mahout framework to analyse the executing time and accuracy efficiencies. Dada and Joseph [12] used RF machine learning algorithm in WEKA environment. They developed a robust spam e-mail filter with less number of features. Borde et al. [7] used various classification techniques like Naïve Bayes, Perceptron and C4.5 and compared the performance of classifiers for classification of spam and ham documents. They suggested Naïve Bayes classifier which provided a better accuracy over other algorithms. Chouhan [9] used SVM lite tool with four kernel functions for classification of spam e-mails. They also worked on the dataset and calculated different utility function like term frequency (TF), inverse document frequency (IDF) and TF-IDF. They suggested that the SVM classifier is better for classification of spam e-mails and ham e-mails. Dadaand Bassi [11] suggested Logistic Model Tree Induction Algorithm in WEKA environment for classification of spam e-mails filtering and achieved a better accuracy over other conventional techniques. Saleh et al. [33] proposed Negative Selec-tion Algorithm for identification and classification of spam e-mails. The proposed method gave the highest accuracy of 93.14% with the Enron1 spam e-mail data set. Diale et al. [15] proposed a novel feature extraction and feature dimension reduction techniques to reduce the space complexity and computationally increase the performance of classifiers like SVM, RF and C4.5 decision tree for classification of spam e-mails. Bahgat et al. [4] suggested Word Net ontology, semantic based methods and similarity measures for reducing the extracted textual features, reducing the space and time complexities. The Principal Component Analysis (PCA) and Correlation Feature Selection (CFS) were used to reduce space complexity and semantic filtering approach combined with the feature selection techniques which achieved high computational performance. Ordás et al. [29] developed Concept Drift Analyzer tool for recognizing the ham and spam e-mails with high accuracy using the K-fold cross-validation technique. Naveiro et al. [28] analysed adversarial risk classification using Naïve Bayes algorithm and ACRA framework approach. Basto-Fernandes et al. [6] suggested the rule based multi-objective optimization problem which is an extension version of anti-spam filtering. Yu et al. (2020) [47] proposed a new technique for generating new phishing e-mail data that can be used to train the classifier with high quality data. Venkatraman et al. [43] proposed the integration of Naïve Bayes (NB) with conceptual and semantic similarity technique for classification of spam e-mails. Dada et al. [10] discussed various machine learning techniques for spam e-mails classification in a systematic way. This research work covered a survey and examined the application of machine learning techniques in the context of spam e-mails classification with different spam e-mail datasets. Mohammad [27] proposed a novel model called ELCADP for a lifelong spam e-mails classification. This model was developed for the classification of spam e-mail documents and compared the performance with other techniques, where the proposed model gave better results. Yu et al. [48] proposed a novel spam filtering analyser for generating new spam samples, hence, the spam filtering analyser was able to increase the generalization of classifier. Hota et al. [18] proposed a novel Remove Replacement Feature Selection Technique (RRFST) along with two decision tree techniques for the classification of phishing e-mails.
The above literature review reveals that identification and classification is a very challenging task. It also emphasizes that the strength of the existing classification techniques can be utilized to develop new models. Most of the researchers have emphasized more on classification with feature selection techniques. These literatures help to contribute toward the development of a new ensemble model empowering e-mail users to protect information from unauthorized persons.

A Framework of Spam e-mail Documents Classification
This research work proposes an ensemble model for classification of spam and ham e-mail documents. The proposed ensemble model is developed using a combination of different data mining based classification techniques to achieve better classification accuracy. In this architecture, we firstly pre-process the spam and ham e-mail documents and group the different folders of spam and ham e-mails documents into a single folder. Then, the spam e-mail dataset is divided into training and testing data partition using 10-fold cross validation. We input the training and testing dataset into different individuals as well as ensemble classifiers.

Enron1 Data Set
The Enron1 dataset is a collection of spam and ham documents collected from Kaggle repository [45]. This dataset consists of 5975 spam and ham e-mail documents where 1500 e-mails documents belong to spam e-mails while 3672 e-mails belong to ham e-mail documents.

Cross Validation
K-fold [23] cross validation is a commonly used to evaluate the performance of machine learning techniques. K-fold cross validation is a process of random partition of data into k consecutive folds. In this research work, we performed a K-fold cross validation with k=10 where the dataset was divided 10 times into 10 different training sets (90% of total dataset) and testing sets (10% of total datasets).

Figure 1
Flow of proposed work for classification of Spam e-mail

Enron1 Data Set
The Enron1 dataset is a collection of spam and ham documents collected from Kaggle repository [45]. This dataset consists of 5975 spam and ham e-mail documents where 1500 e-mails documents belong to spam e-mails while 3672 e-mails belong to ham e-mail documents.

Cross Validation
K-fold [23] cross validation is a commonly used to evaluate the performance of machine learning techniques. K-fold cross validation is a process of random partition of data into k consecutive folds. In this research work, we performed a K-fold cross validation with k=10 where the dataset was divided 10 times into 10 different training sets (90% of total dataset) and testing sets (10% of total datasets).

Machine Learning Techniques
Machine learning (ML) [46] is a subset of artificial intelligence which is concerned with learning from data, analysing data and get some relevant knowledge from large amount of dataset. The main aim of ML technique is to design and develop robust model which can be used to arrive at the data with better performance. ML can be categorized into supervised, semi-supervised, unsupervised and reinforcement learning. This research work has used the supervised learning algorithm for classification of spam and ham e-mail documents. Various supervised machine learning techniques used in this research work are discussed below:

Decision Tree (DT)
Decision tree  is one of the most popular and well-known data mining based classification techniques for classification and prediction task. Each node in decision tree indicates either a decision node or a leaf node where leaf node represents the value of target attributes of instances. A decision tree is to split dataset into different subsets recursively so that each subset contains more or less homogeneous states of our target variable.

K-Nearest Neighbor (K-NN)
K-NN  is a data mining technique that is widely used in the field of classification, prediction and pattern recognition. It is also a type of supervised machine learning technique where model is trained with training samples and the trained model is tested with testing samples. Each training sample is described by number of attributes and each sample represents a point in the n-dimensional space.

Support Vector Machine (SVM)
SVM [49] is a supervised learning technique that is useful for solving the traditional classification problem. where each input tuple is associated with one class label. SVM is used for both linear and nonlinear data classification. SVM is based on the concept of hyper plane and divides the n dimensional space of data into two regions. This hyper plane always maximizes the margin between the two regions. The margin is defined by the longest distance between the examples of the two regions and is computed based on the distance between the closest instances of both regions to the margin, which are called supporting vectors.

Multilayer Perceptron (MLP)
Multilayer Perceptron [36] is an advancement from the straightforward perceptron in which extra shrouded layers are included. It contains more than one hidden layer, so it is called multilayer perceptron. MLP structure is formed from the input layer to the first hidden layer, from the first hidden layer to the second and so on, to the output layer to the last hidden layer. MLP handles the non-linear data. It is a super-vised machine learning that can be used for classification and prediction.

Ensemble Technique
Ensemble technique [24] is a strategy combining two or more models for improving the accuracy compared to other individual models. The main purpose of the ensemble technique is to expand the accuracy and avoid the drawback of individual models. In this research paper we have used Random Forest (RF), Bagging, and Boosting (AdaBoosting and Gradient Boosting) ensemble methods. We have also used voting scheme for combining data mining based classification techniques. _ Random Forest(RF) RF [31] is an ensemble classifier that is a combination of many decision trees. The main motive of this ensemble classifier is to achieve better accuracy compared to individuals. RF is basically used with very large training datasets and a very large number of input features. A RF classifier is basically a combination of tens or hundreds of decision trees. _ Bagging and Boosting Bagging and boosting [24] are two well-known ensemble methods that can be used to combine models. The main aim of using this methods are to improve the performance of the model. Both bagging and boosting can be used for classification as well as prediction. In this research work we have used Bagging, AdaBoosting and Gradient Boosting for classification of spam e-mail documents. _ Voting Scheme Voting scheme [24] is a meta classifier and the most important ensemble technique to combine any classifier through majority of voting. The final class label is predicted by a majority of the classifiers. The final class label Fj is defined as Fj= mode {C 1 , C 2 , C 3 , , ,Cn }, where {C 1 , C 2 , C 3 , , ,Cn } indicates the individual classifiers that participate in the voting. This research work has used voting scheme to develop a proposed classifier.
Accuracy [38] is one of the important measures to check the performance of any model. It is the ratio between the correctly classified positive and negative samples to the total number of samples as follows: Sensitivity [38] is also called True positive rate (TPR), hit rate, or recall. It is represented as the ratio of positive correctly classified samples to the total number of positive samples as follows:

Sensitivity = TP / (TP+FN)
Specificity [38] is also called True negative rate (TNR), or inverse recall and is expressed as the ratio of the correctly classified negative samples to the total number of negative samples as follows:

Specificity = TN / (TN+FP)
Precision [19] can be expressed as the rate of instances classified correctly among the results of classifier.

Precision= TP / (TP + FP)
F-score [19] is the harmonic mean of precision and recall.
F-score = Receiver operating characteristics (ROC) [22] is another important measure to check the performance of a model. ROC curve represents the trade-off between True Positive Rate (Sensitivity) and False Positive Rate (1-specificity) to check the performance of predictive model where TPR represents the y-axis and FPR represents the x-axis. The main concept of ROC curve is to maintain a balance between the true positives, and false positives.
Area under the ROC curve (AUC) [22] is another performance measure to calculate the area under the ROC curve. The AUC score is always bounded between zero and one.

Experimental Results
This experiment work is carried out using Python (Jupyter notebook) with Anaconda environment in Window7 operating system. Nowadays, Python is an emerging software tool for web development, scientific computing, image processing, data analysis, machine learning and deep learning. In this research work, we propose an ensemble model and check the robustness and efficiency of the model. Efficiency and robustness of the proposed ensemble model is verified using different performance measures like accuracy, sensitivity, specificity, and precision, F-score, ROC curve and AUC score with Enron1 dataset. Enron1 data set is a collection of spam and ham documents. We propose four ensemble models namely, Ensemble Model-1, Ensemble Model-2, Ensemble Model-3 and Ensemble Model-4 for the classification of spam and ham documents. This research work uses different individuals, well-known ensemble classifiers and proposed ensemble models for classifying spam and ham e-mail documents as shown in Table  1. Table 1 shows the accuracy of individual classifiers, existing ensemble classifiers and proposed ensemble classifiers, where Naïve Bayes (NB) gives the highest    shows the comparative analysis of various individual classifiers and the proposed ensemble models with AUC score, in which our proposed Ensemble Model-1 gives the highest AUC score compared to the others. Finally, we conclude that the proposed Ensemble Model-1 is recommended for the classification of spam and ham semails.  Table  4 shows the comparative analysis of various individual classifiers and the proposed ensemble models with AUC score, in which our proposed Ensemble Model-1 gives the highest AUC score compared to the others. Finally, we conclude that the proposed Ensemble Model-1 is recommended for the classification of spam and ham semails.   Table 4 shows the comparative analysis of various individual classifiers and the proposed ensemble models with AUC score, in which our proposed Ensemble Model-1 gives the highest AUC score compared to the others. Finally, we conclude that the proposed Ensemble Model-1 is recommended for the classification of spam and ham se-mails.

Results Analysis
The proposed Ensemble Model-1 has given better classification accuracy compared to other models previously developed by different researchers on Enron1 dataset, as shown in Table 5. The table below shows that our proposed Ensemble Model-1 is an effective and robust model for the classification of spam and ham e-mails.

Conclusions and Future Work
This study has shown that the proposed Ensemble Model-1 outperforms other existing spam e-mail filtering methods in terms of classification accuracy. More importantly, it has classified both spam and ham documents in satisfactory levels. The comparative analysis of proposed Ensemble Model-1 outperformed previous approaches and existing classifiers on Enron1 dataset. The novelty of the proposed model is to obtain accurate results for classification of spam and ham e-mail documents. However, a further experiment is needed on the other datasets to show that the proposed model is not limited to classification of spam and ham e-mail documents. In future, the proposed model can be effectively applied on high dimension imbalanced text classification problems like news and social network based data as well as sentimental analysis.