With the growth of the discipline of digital communication, the topic has acquired more attention in the cybersecurity medium. The Intrusion Detection (ID) system monitors network traffic to detect malicious activities. The paper introduces a novel Feature Selection (FS) approach for ID. Reptile Search Algorithm (RSA)—is a new optimization algorithm; in this method, each agent searches a new region according to the position of the host, which makes the algorithm suffers from getting stuck in local optima and a slow convergence rate. To overcome these problems, this study introduces an improved RSA approach by integrating Cauchy Mutation (CM) into the RSA’s structure. Thus, the CM can effectively expand search space and enhance the performance of the RSA. The developed RSA-CM is assessed on five publicly available ID datasets: KDD-CUP99, NSL-KDD, UNSW-NB15, CIC-IDS2017, and CIC-IDS2018 and two engineering problems. The RSA-CM is compared with the original RSA, and three other state-of-the-art FS methods, namely particle swarm optimization, grey wolf optimization, and multi-verse optimizer, and quantitatively is evaluated using fitness value, the number of selected optimum features, accuracy, precision, recall, and F1-score evaluation measures. The results reveal that the developed RSA-CM got better results than the other competitive methods applied for FS on the ID datasets and the examined engineering problems. Moreover, the Friedman test results confirm that RSA-CM has a significant superiority compared to other methods as an FS method for ID.
Due to the increased internet usage rate caused by the widespread computer networks, security has become one of the most critical areas for research because of the threats and attacks on these networks, which are now more aggressive than before [
With the increase in the number of attacks, cybersecurity companies focus on developing sensitive systems besides traditional security methods [
In order to achieve optimal security requirements of a network, researchers have focused on the use of ML approaches to develop an ID system that can detect such types of attacks more accurately [
Selecting OFS in a given dataset facilitates learning by ML techniques to achieve better prediction, and classification results for ID. Nature-inspired algorithms are mostly Meta-Heuristics (MH) optimization methods inspired by nature. They gained special attention from scholars in different applications due to their great potential to specify OFS. These methods are effective, and reliable gradient-free stochastic optimization techniques that have been successful in various numerical, and combinatorial optimization problems with diverse frameworks [
MH algorithms can be combined to achieve better results for FS in different applications. The authors in [
These methods use two principles that are characteristic in all optimization techniques, which are exploration and exploitation. In exploration, the algorithm tries to find different regions in the search area, while the second principle, exploitation, and the method searches around the obtained solution from the first phase to find the best solutions. In this paper, an improved version of RSA, named RSA-CM for ID is introduced. The RSA-CM combines the original RSA with CM to enhance the exploration capability and maintain a balance between exploration, and exploitation of the RSA. The main contributions of this work could be summarized as follows:
An improved version of RSA using CM named RSA-CM is introduced for ID. CM strategy is used to boost the search mechanism of the RSA during the search process. The RSA-CM is examined using five open access datasets for ID, and two popular engineering optimization problems. The results confirm the efficacy of the RSA-CM compared to other MH methods and the engineering problems as well.
This paper is organized as follows: Section 2 provides a brief idea of RSA and CM, followed by a description of the developed method presented in Section 3. The experimental results, and statistical comparison with other FS methods are shown in Section 4, and Section 5 concludes this paper.
In 2022, Abualigah
The crocodiles’ food search is implemented in RSA using two separate strategies namely, exploration and exploitation. For sequential implementation of these two strategies, the maximum number of iterations is split into four stages. In the first half of the total number of stages, the crocodile’s encircling behavior is implemented using the high and the belly walking movements of the crocodile to effectively explore the region. This stage can mathematically be written as:
Crocodiles’ hunting coordination and cooperation are implemented to exploit the search space. The exploitation stage can be mathematically represented as:
The algorithm terminates after T iterations while the performance of each set of candidate OFS is evaluated using a predefined Fitness Function (FF). The OFS is a candidate feature set with the smallest FF.
Several mutation operators are introduced in the literature to escape the problem of premature convergence and to improve the performance. Among them, CM shows a powerful capability due to its extended tail probability distribution function, which can enrich the performance, and prevent getting, stuck in any optimization method’s local optima.
CM is a continuous probability distribution having two parameters, where
In RSA, the exploration phase is performed by encircling the prey, and exploitation is done in the subsequent stages. However, this may result in the method suffering from premature convergence. Accordingly, CM is integrated into the RSA structure to escape from being trapped in local solutions by allowing RSA to jump, and visit new locations in the search space. This will help the RSA control, and balance the exploration, and exploitation abilities during the search process. The flowchart of the RSA-CM is provided in
For
The performance of the updated solution is calculated using FF, as shown in
The capability of the interdicted RSA-CM method to determine the OFS is assessed using five datasets for ID and comparing it with other FS methods: PSO [
Python is used to implement all the methods used in this work and they are executed on a 3.13 GHz PC with 16 GB RAM and Windows 11 operating system. The parameter settings for all the methods are provided in
Method | Parameters |
---|---|
PSO | |
GWO | |
MVO | |
RSA | |
Common settings | Population size = 32, number of runs = 20, & number of iterations = 100 |
Five real datasets from ID applications are selected to assess RSA-CM efficiency. These datasets are widely used for ID [
Dataset | Source | No. of features | No. of samples |
---|---|---|---|
KDD-CUP99 | [ |
43 | 494,020 |
NSL-KDD | [ |
43 | 125,973 |
UNSW-NB15 | [ |
49 | 540,044 |
CIC-IDS2017 | [ |
78 | 2,827,876 |
CIC-IDS2018 | [ |
80 | 1,048,575 |
The datasets contain huge number of records for normal activities and network attacks. Using an iterative FS such as MH methods will be computationally expensive. Hence, only 10% of the dataset is used for FS evaluation while maintaining the ratio of natural activities and network attacks.
The quantitative evaluation measures employed to compare the proposed RSA-CM and the other MH methods are as follows:
Fitness values are used to compute the quality of the solution, which is used to guide the searching process by the RSA-CM method. The number of OFS is used to illiterate RSA-CM’s ability to reduce number of features in a given dataset. Accuracy (AC): It calculates the accuracy over the total number of runs and in this work number of runs is 20:
Precision (P): It measures the actual positives which are actually positive:
Recall (R): It measures the proportion of actual positives which are correctly identified:
F-measure (F): is the harmonic mean of recall and precision measures and it is defined as:
To examine the efficacy of the RSA-CM as an FS method, the real-world datasets provided in
Dataset | Measure | Method | ||||
---|---|---|---|---|---|---|
PSO | GWO | MVO | RSA | RSA-CM | ||
KDD-CUP99 | Mean | 0.0335 | 0.0220 | 0.0199 | 0.0094 | |
Std | 0.0096 | 0.0093 | 0.0073 | 0.0078 | ||
NSL-KDD | Mean | 0.0602 | 0.0746 | 0.0687 | 0.0593 | |
Std | 0.0081 | 0.0102 | 0.0092 | 0.0093 | ||
UNSW-NB15 | Mean | 0.0372 | 0.0318 | 0.0354 | 0.0308 | |
Std | 0.0075 | 0.0057 | 0.0052 | 0.0071 | ||
CIC-IDS2017 | Mean | 0.0261 | 0.0250 | 0.0151 | 0.0208 | |
Std | 0.0084 | 0.0066 | 0.0090 | 0.0082 | ||
CIC-IDS2018 | Mean | 0.0340 | 0.0300 | 0.0402 | 0.0303 | |
Std | 0.0072 | 0.0094 | 0.0093 | 0.0091 |
The results of the proposed RSA-CM and the other MH algorithms based on the mean and standard deviation (Std) of the number of optimum features selected by the corresponding MH algorithm are provided in
Dataset | Measure | Method | ||||
---|---|---|---|---|---|---|
PSO | GWO | MVO | RSA | RSA-CM | ||
KDD-CUP99 | Mean | 40 | 35 | 41 | ||
Std | 5 | 9 | 6 | 7 | ||
NSL-KDD | Mean | 38 | 34 | 39 | 37 | |
Std | 4 | 6 | 5 | 5 | ||
UNSW-NB15 | Mean | 33 | 29 | 37 | 23 | |
Std | 10 | 9 | 6 | |||
CIC-IDS2017 | Mean | 23 | 63 | 49 | 25 | 61 |
Std | 3 | 6 | 7 | 7 | 5 | |
CIC-IDS2018 | Mean | 45 | 49 | 71 | 55 | |
Std | 10 | 10 | 9 |
Dataset | Measure | Method | ||||
---|---|---|---|---|---|---|
PSO | GWO | MVO | RSA | RSA-CM | ||
KDD-CUP99 | Mean | 0.9756 | 0.9860 | 0.9895 | 0.9957 | |
Std | 0.0314 | 0.0385 | 0.0342 | 0.0294 | ||
NSL-KDD | Mean | 0.9481 | 0.9326 | 0.9398 | 0.9488 | |
Std | 0.0231 | 0.0327 | 0.0353 | 0.0726 | ||
UNSW-NB15 | Mean | 0.9702 | 0.9747 | 0.9729 | 0.9743 | |
Std | 0.0420 | 0.0368 | 0.0391 | 0.0205 | ||
CIC-IDS2017 | Mean | 0.9917 | 0.9884 | 0.9863 | 0.9906 | |
Std | 0.0744 | 0.0697 | 0.0835 | |||
CIC-IDS2018 | Mean | 0.9762 | 0.9812 | 0.9761 | 0.9823 | |
Std | 0.0486 | 0.0529 | 0.0584 | 0.0308 |
Dataset | Measure | Method | ||||
---|---|---|---|---|---|---|
PSO | GWO | MVO | RSA | RSA-CM | ||
KDD-CUP99 | Mean | 0.9846 | 0.9821 | 0.9867 | 0.9916 | |
Std | 0.0331 | 0.0581 | 0.0464 | 0.0366 | ||
NSL-KDD | Mean | 0.9165 | 0.9181 | 0.9138 | 0.9256 | |
Std | 0.0519 | 0.0672 | 0.0616 | 0.0346 | ||
UNSW-NB15 | Mean | 0.9633 | 0.9727 | 0.9736 | 0.9745 | |
Std | 0.0822 | 0.0529 | 0.0869 | 0.0573 | ||
CIC-IDS2017 | Mean | 0.9801 | 0.9789 | 0.9753 | 0.9779 | |
Std | 0.0517 | 0.0226 | 0.0852 | 0.0337 | ||
CIC-IDS2018 | Mean | 0.9738 | 0.9752 | 0.9656 | 0.9734 | |
Std | 0.0907 | 0.0620 | 0.0420 | 0.0450 |
The mean and Std of recall for different MH algorithms are compared in
Dataset | Measure | Method | ||||
---|---|---|---|---|---|---|
PSO | GWO | MVO | RSA | RSA-CM | ||
KDD-CUP99 | Mean | 0.9780 | 0.9793 | 0.9865 | 0.9946 | |
Std | 0.0282 | 0.0303 | 0.0270 | 0.0182 | ||
NSL-KDD | Mean | 0.9507 | 0.9463 | 0.9468 | 0.9566 | |
Std | 0.0541 | 0.0916 | 0.0455 | 0.0517 | ||
UNSW-NB15 | Mean | 0.9685 | 0.9736 | 0.9734 | 0.9743 | |
Std | 0.0566 | 0.0495 | 0.0465 | 0.0734 | ||
CIC-IDS2017 | Mean | 0.9949 | 0.9832 | 0.9895 | 0.9953 | |
Std | 0.0418 | 0.0257 | 0.0239 | 0.0227 | ||
CIC-IDS2018 | Mean | 0.9615 | 0.9571 | 0.9653 | 0.9633 | |
Std | 0.0572 | 0.0694 | 0.0634 | 0.0703 |
Dataset | Measure | Method | ||||
---|---|---|---|---|---|---|
PSO | GWO | MVO | RSA | RSA-CM | ||
KDD-CUP99 | Mean | 0.9813 | 0.9807 | 0.9866 | 0.9931 | |
Std | 0.0863 | 0.0726 | 0.0391 | 0.0345 | ||
NSL-KDD | Mean | 0.9333 | 0.9320 | 0.9323 | 0.9414 | |
Std | 0.0220 | 0.0257 | 0.0213 | 0.0197 | ||
UNSW-NB15 | Mean | 0.9659 | 0.9731 | 0.9735 | 0.9744 | |
Std | 0.0737 | 0.0099 | 0.0918 | 0.0704 | ||
CIC-IDS2017 | Mean | 0.9874 | 0.9810 | 0.9823 | 0.9865 | |
Std | 0.0435 | 0.0611 | 0.0703 | 0.0838 | ||
CIC-IDS2018 | Mean | 0.9676 | 0.9661 | 0.9654 | 0.9683 | |
Std | 0.0790 | 0.0491 | 0.0759 | 0.0329 |
Comparative analysis of convergence of RSA-CM and different MH methods is shown in
Boxplot is used to visualize representations of data distribution of the results in terms of accuracy in three quartiles: lower, middle, and upper. A boxplot of all MH algorithms over five datasets is shown in
The RSA-CM method is employed to solve two engineering problems with constraints, including Pressure Vessel Design (PVD) and Three-bar Truss Design, and the results are provided in this section.
In this problem, the PVD seeks to minimize the welding cost of the pressure vessel using the constraints on material and shipping. It consists of four variables, as illustrated in
Minimize
Subject to
Method | Optimal values | Optimal cost | |||
---|---|---|---|---|---|
PSO | 1.0000 | 0.0000 | 120.0000 | 10.5012 | 2414.0478 |
GWO | 1.2591 | 0.0000 | 65.2298 | 10.0000 | 2101.8663 |
MVO | 1.2614 | 0.0000 | 65.2280 | 10.1553 | 2110.2778 |
RSA | 1.0000 | 0.0000 | 110.0000 | 9.5346 | 2212.5875 |
RSA-CM | 1.2588 | 0.0000 | 65.2252 | 10.0000 |
A TBD’s optimal design seeks to reduce the structure weight subject to support total load acting vertically downward. The structural geometry of the problem is given in
Minimize
Subject to
The RSA-CM results for solving the problem of TBD are provided in
Method | Optimal values for variables | Optimal weight | |
---|---|---|---|
PSO | 1.3240 | 0.0000 | 325.4535 |
GWO | 1.2591 | 0.0000 | 317.3767 |
MVO | 1.2614 | 0.0000 | 317.6665 |
RSA | 1.2613 | 0.0000 | 317.6539 |
RSA-CM | 1.2588 | 0.0000 |
The Friedman test, a widely used non-parametric two-way analysis of variances by ranks [47], is performed to identify the significance of the performance evaluation measures on five datasets and five MH algorithms with 20 independent runs. The test assumes a null hypothesis (
Dataset | Metric | PSO | GWO | MVO | RSA | RSA-CM |
---|---|---|---|---|---|---|
KDD-CUP99 | ACC | 2.4 | 3.3 | 2.65 | 3.25 | |
OFS | 2.7 | 3.15 | 2.95 | 2.89 | ||
Fitness | 3.2 | 3.25 | 3.15 | 3 | ||
NSL-KDD | ACC | 3.05 | 3.25 | 2.65 | 3.45 | |
OFS | 2.8 | 3.05 | 3.05 | 3.4 | ||
Fitness | 2.95 | 3 | 2.9 | 2.9 | ||
UNSW-NB15 | ACC | 2.5 | 2.45 | 2.5 | 2.45 | |
OFS | 2.95 | 3.45 | 2.9 | 3.2 | ||
Fitness | 3.3 | 3.6 | 3.25 | 2.9 | ||
CIC-IDS2017 | ACC | 2.6 | 2.9 | 3.2 | 3.25 | |
OFS | 3.6 | 3.35 | 3.05 | 3.05 | ||
Fitness | 3.35 | 3.45 | 2.85 | 2.85 | ||
CIC-IDS2018 | ACC | 2.9 | 3.25 | 3 | ||
OFS | 3 | 3 | 2.85 | 2.7 | ||
Fitness | 3.2 | 3.5 | 3.05 | 2.65 |
Note: Highlight (bold) denotes the best performance of the corresponding metric.
Several security solutions based on ML have been developed in recent years, including ID systems. However, the existence of irrelevant or redundant data affects the performance of ML methods and their performance. Therefore, a novel FS method to improve the capability of the original RSA in exploration and exploitation using CM is presented. The CM is used to expand search capability of the RSA, which in turns prevent the RSA from getting stuck in local optima and improve its convergence speed. The developed RSA-CM efficiency is validated using five open-access datasets in the ID domain and two engineering problems. Its efficiency is also compared with PSO, GWO, MVO, and RSA methods. The results show that the RSA-CM performs better than the other methods on almost the datasets and the tested engineering problems in terms of several evaluation metrics used in this work. Moreover, Friedman test outcomes show that the proposed RSA-CM has the most significant results compared to other methods. These results make introduced RSA-CM superior to other comparative methods and more suitable to be used as a FS approach for the application of ID. In future work, we will attempt to use developed RSA-CM as an FS method in other applications such as text mining, image segmentation, and IoT.
The author received no specific funding for this study.
The author declare that they have no conflicts of interest to report regarding the present study.