Spam has turned into a big predicament these days, due to the increase in the number of spam emails, as the recipient regularly receives piles of emails. Not only is spam wasting users’ time and bandwidth. In addition, it limits the storage space of the email box as well as the disk space. Thus, spam detection is a challenge for individuals and organizations alike. To advance spam email detection, this work proposes a new spam detection approach, using the grasshopper optimization algorithm (GOA) in training a multilayer perceptron (MLP) classifier for categorizing emails as ham and spam. Hence, MLP and GOA produce an artificial neural network (ANN) model, referred to (GOAMLP). Two corpora are applied Spam Base and UK-2011 Web spam for this approach. Finally, the finding represents evidence that the proposed spam detection approach has achieved a better level in spam detection than the status of the art.
Despite the popularity of social networking services to spread messages over the Internet, email remains at the forefront of social, academic, and business communications [
Systematic review literature showed that the ML method in purifying mail achieves effective categorization. They include ANNs [
Several stochastic global optimization (SGO) approaches demonstrate higher accuracy and computational efficiency compared to trajectory driven approaches like Backpropagation (BP). When applying SGO techniques for trained NNs, problems related to BP are resolved [
Some studies utilized several techniques concurrently. For example, research by [
The implementation of the SD approach regarding ANNs trained through GOA, as presented in
MLPs were prevalent in the spam detection approach due to their efficiency in classifying mail as wanted or spam. This sort of tool considers the system's common components, the environment of electronic mailing, and statistically significant departures from expected user nature. These tools have open and extendable architectures that aid in the creation of intelligent character models in the environment. The protocol composes determining structure and variables of ANNs to learn the relationship between the incoming patterns and the target output through training. The training comprises? the protocol specifying the structure and variables of ANNs to learn the relationship between incoming patterns and target outputs. By decreasing the value of the MSE, the ANN structure and weights (
The GOA is a lately proposed swarm-primarily based totally meta-heuristic [
The parameter
The second part
The meta-heuristic algorithms have already shown great potential in solving the problem of classification or prediction of ANNs by tuning the
In the final stage, the algorithms is applied to modify the variables of the gradient descent learning algorithms. In our Reference [
Lines 13 to 19 comprise providing a little change to the GOA operator in early testing of the probability parameter P and choosing one of two alternative approaches to balance the application of the GOA operator to either the ANNs structure or its
The possibility to know the answer a parent is proportional to the amount by that its fitness is a smaller amount than other of the opposite solution's fitness. The GOA optimization process is used with a 50% probability in the ANN structure, and there is a 50% that applies the optimization process to
The biases associated to every neuron are in hidden and output layers. The GOAMLP solution is represented by two one-dimensional vectors: 1) ANNs structure SV indicates the amount of inputs, the amount of hidden layers and amount of neurons at every hidden layer in ANNs. 2)
This FF that can be utilized to assess the quality of the solutions to minimize the values obtained. In essence, this training method is similar to the previous studies [
In this section the experiments for compared models were performed with a laptop configuration Core i5, 8 GB RAM, 2.4 GHz CPU, and MATLAB R2014a. The GOAMLP classifier is evaluated using the two data sets. The first dataset is Spam Base and consists of 4601 instances with 57 features. It consists of 1813 spam and 2788 legitimate emails. The dataset was obtained from the UCI [
Different algorithms have been studied that analyse the reliability of the new model. All control variables of algorithms were set to the same values, SV solution, and dimensionality of SD that denotes features of the dataset. Shown in
Alg. | Parameter | Value | Alg. | Parameter | Value |
---|---|---|---|---|---|
MBO | Butterfly adjusting rate | 0.4167 | HS | Harmony memory size | 50 |
Max step | 1.0 | Harmony memory consideration rate | 0.95 | ||
Migration period | 1.2 | Pitch adjustment rate | 0.1 | ||
Migration ratio | 0.4 | ||||
ALO | Linear decreased |
2 |
DE | Factor of weight |
0.5 |
ABC | Limit | 100 | CS | Alien eggs/solutions rate | 0.25 |
GOA | C-min |
0.00004 |
PSO | Inertial constant |
0.3 |
GSA | G0 |
100 |
SCA | Random number |
[0, 1] |
WOA | Linearly decreased | 2 to 0 | PBIL | Habitat modification probability | 1 |
Random vector | [0, 1] | Immigration probability bounds per gene | [0, 1] | ||
Coefficient vectors | [−1, 1] | Step size for numerical integration | 1 | ||
Coefficient vectors | [1, 1] | of probabilities | |||
Random number | [−1, 1] | Maximum immigration and migration rate | 1 | ||
Random number | [0, 1] | Mutation probability | 0.005 |
The proposed model compared ten basic measurements popularly applied in evaluating performance of the GOAMLP SD approach. The confusion matrix consists of four values include false negative (FN), true positive (TP), true negative (TN), and false positive (FP) rates.
Measure | Definition | Measure | Definition | ||
---|---|---|---|---|---|
Accuracy (ACC) | (19) | Positive predictive value (PPV) | (24) | ||
False alarm rate (FAR) | (20) | Negative predictive value (NPV) | (25) | ||
Detection rate (DR) | (21) | F-measure (F1) | (26) | ||
Sensitivity (SN) | (22) | Matthews correlation coefficient (MCC) | ((TP |
(27) | |
Specificity (SP) | (23) | G-mean (G-M) | (28) |
As mentioned, this study uses two standard datasets to measure performance on data in different domains. Therefore, as is obligatory to normalize values of features to allow effective application to MLPs training, the minimum to maximum normalization technique was applied. The results from datasets are described as follows:
The results of GOAMLP SD approach and related models are computed using the
No. | Models | ACC | DR | FAR | MCC | PPV | NPV | SN | SP | F1 | G-M | R-ACC | R-DR | R-FAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ABCMLP | 73.4 | 81.6 | 0.392 | 0.43 | 0.76 | 0.68 | 0.82 | 0.61 | 0.79 | 70.5 | 10 | 7 | 12 |
2 | ALOMLP | 90.1 | 90.0 | 0.097 | 0.80 | 0.93 | 0.85 | 0.90 | 0.90 | 0.92 | 90.1 | 2 | 2 | 2 |
3 | CSMLP | 88.0 | 88.6 | 0.131 | 0.75 | 0.91 | 0.83 | 0.89 | 0.87 | 0.90 | 87.8 | 3 | 3 | 4 |
4 | DEMLP | 82.5 | 80.6 | 0.147 | 0.65 | 0.89 | 0.74 | 0.81 | 0.85 | 0.85 | 82.9 | 6 | 9 | 5 |
5 | GOAMLP | 94.1 | 94.0 | 0.057 | 0.88 | 0.96 | 0.91 | 0.94 | 0.94 | 0.95 | 94.2 | 1 | 1 | 1 |
6 | GSAMLP | 82.0 | 82.4 | 0.186 | 0.63 | 0.87 | 0.75 | 0.82 | 0.81 | 0.85 | 81.9 | 7 | 6 | 9 |
7 | HSMLP | 71.9 | 71.3 | 0.272 | 0.43 | 0.80 | 0.62 | 0.71 | 0.73 | 0.75 | 72.0 | 11 | 11 | 10 |
8 | MBOMLP | 81.4 | 81.1 | 0.180 | 0.62 | 0.87 | 0.74 | 0.81 | 0.82 | 0.84 | 81.5 | 8 | 8 | 8 |
9 | PBILMLP | 65.8 | 62.3 | 0.289 | 0.33 | 0.77 | 0.55 | 0.62 | 0.71 | 0.69 | 66.6 | 12 | 12 | 11 |
10 | PSOMLP | 81.2 | 79.2 | 0.156 | 0.62 | 0.89 | 0.73 | 0.79 | 0.84 | 0.84 | 81.7 | 9 | 10 | 6 |
11 | SCAMLP | 87.4 | 86.6 | 0.114 | 0.74 | 0.92 | 0.81 | 0.87 | 0.89 | 0.89 | 87.6 | 4 | 5 | 3 |
12 | WOAMLP | 86.2 | 87.4 | 0.156 | 0.71 | 0.90 | 0.81 | 0.87 | 0.84 | 0.88 | 85.9 | 5 | 4 | 6 |
Results recorded by the ALOMLP algorithm were roughly similar to GOAMLP with an ACC of 90.1%, DR of 90.0%, and FAR of 0.097; the CSMLP algorithm was rated third with regard to ACC and DR and rated 4th with regard to FAR of 88.0%, 88.6%, and 0.131, respectively. The SCAMLP was rated third with regard to FAR of 0.114 but rated fourth with regard to the ACC of 87.4%. The WOAMLP was rated 5th with regard to ACC, rated 4th with regard to DR, and rated sixth with regard to FAR of 86.2%, 87.4%, and 0.156, respectively. On another hand, the PBILMLP algorithm has an inferior.
Models | ACC | DR | FAR | MCC | PPV | NPV | SN | SP | F1 | G-M | R-ACC | R-DR | R-FAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ABCMLP | 84.3 | 85.8 | 0.170 | 0.69 | 0.82 | 0.87 | 0.86 | 0.83 | 0.84 | 84.4 | 8 | 7 | 10 |
ALOMLP | 87.5 | 89.2 | 0.140 | 0.75 | 0.85 | 0.90 | 0.89 | 0.86 | 0.87 | 87.6 | 4 | 3 | 7 |
CSMLP | 86.1 | 86.8 | 0.145 | 0.72 | 0.84 | 0.88 | 0.87 | 0.85 | 0.85 | 86.1 | 6 | 5 | 8 |
DEMLP | 81.2 | 80.6 | 0.182 | 0.62 | 0.80 | 0.83 | 0.81 | 0.82 | 0.80 | 81.2 | 11 | 11 | 11 |
GOAMLP | 92.7 | 93.4 | 0.078 | 0.85 | 0.91 | 0.94 | 0.93 | 0.92 | 0.92 | 92.8 | 1 | 1 | 1 |
GSAMLP | 80.9 | 82.3 | 0.204 | 0.62 | 0.78 | 0.84 | 0.82 | 0.80 | 0.80 | 80.9 | 12 | 9 | 12 |
HSMLP | 86.8 | 83.6 | 0.104 | 0.74 | 0.88 | 0.86 | 0.84 | 0.90 | 0.86 | 86.6 | 5 | 8 | 4 |
MBOMLP | 88.9 | 86.4 | 0.088 | 0.78 | 0.90 | 0.88 | 0.86 | 0.91 | 0.88 | 88.8 | 3 | 6 | 2 |
PBILMLP | 83.8 | 79.6 | 0.125 | 0.67 | 0.85 | 0.83 | 0.80 | 0.87 | 0.82 | 83.5 | 10 | 12 | 5 |
PSOMLP | 84.0 | 81.5 | 0.139 | 0.68 | 0.84 | 0.84 | 0.82 | 0.86 | 0.83 | 83.8 | 9 | 10 | 6 |
SCAMLP | 85.5 | 87.0 | 0.159 | 0.71 | 0.83 | 0.88 | 0.87 | 0.84 | 0.85 | 85.5 | 7 | 4 | 9 |
WOAMLP | 91.6 | 93.0 | 0.097 | 0.83 | 0.89 | 0.94 | 0.93 | 0.90 | 0.91 | 91.7 | 2 | 2 | 3 |
In the area of speed of convergence,
Ref. | Year | DS | Method | EC | Results | Ref. | Year | DS | Method | EC | Results |
---|---|---|---|---|---|---|---|---|---|---|---|
[ |
2012 | UK | SEO | ACC | 89.01 | [ |
2020 | SB | WOAFPA | ACC | 94 |
[ |
2016 | UK | MLP-GD | DR | 82.39 | [ |
2020 | SB | SVM/RF | ACC | 89.2/91.4 |
[ |
2012 | UK | D.F. | ACC | 95.05 | [ |
2020 | SB | SCAC | ACC | 94 |
Our model | UK | GOAMLP | ACC | 92.7 | Our model | SB | GOAMLP | ACC | 94.1 |
Note: References → Ref; DS → Dataset; EC → Evaluation Criteria; UK → UK-2011Web spam; SB → Spam Base
The difference between the models was tested for statistical significance using a t-test. The analysis shown in
Model | Dataset | Model | Dataset | ||||||
---|---|---|---|---|---|---|---|---|---|
SB | UK | SB | UK | ||||||
t Stat | Sig. | t Stat | Sig. | t Stat | Sig. | t Stat | Sig. | ||
ABC | 4.6E+01 | 5.8E−69 | 2.6E+01 | 1.4E−45 | MBO | 5.9E+01 | 6.3E−79 | 3.0E+00 | 3.7E−03 |
ALO | 4.8E−02 | 9.6E−01 | 1.4E+01 | 3.7E−25 | PBIL | 4.5E+01 | 2.2E−67 | 5.2E+01 | 7.4E−74 |
CS | 5.2E+00 | 9.3E−07 | 1.6E+01 | 1.5E−29 | PSO | 5.7E+01 | 3.6E−77 | 5.2E+01 | 1.2E−73 |
DE | −4.7E+00 | 7.1E−06 | 6.4E+01 | 4.8E−82 | SCA | 1.9E+01 | 7.0E−34 | 6.5E+01 | 1.1E−82 |
GSA | 5.5E+01 | 2.4E−76 | 6.1E+01 | 1.4E−80 | WOA | 6.4E+01 | 5.2E−82 | 4.8E+00 | 5.0E−06 |
HS | 5.1E+01 | 4.6E−73 | 4.3E+01 | 2.2E−65 |
Note: Sig.
This work introduced a novel approach for SD, namely, the GOAMLP. The focus was on the applicability of the GOA to train MLP. The performance of the proposed GOAMLP compared to the most recent SD. The work utilized 12 algorithms to train the MLP. The GOAMLP was trained against the benchmark datasets of Spam Base, and UK-2011Web spam and had classification accuracies of 94.1%, and 92.7%, detecting rates of 94.0%, and 93.4%, respectively; and finally, false alarm rates of 0.057, and 0.078. These results are higher than the result from other models that were tested using the same datasets. The outcomes display the adequacy of the proposed approach for spam detectors. All approaches were measured with regard to features of SD datasets.