Computers, Materials & Continua DOI:10.32604/cmc.2022.030878 | |
Article |
Swarm Optimization and Machine Learning for Android Malware Detection
1Centurion University of Technology and Management, Paralakhemundi, Odisha, India
2Maharaj Vijayaram Gajapathi Raj College of Engineering, Vizianagaram, India
3Centurion University of Technology and Management, Bhubaneswar, Odisha, India
*Corresponding Author: K. Santosh Jhansi. Email: santosh.jhansi@gmail.com
Received: 04 April 2022; Accepted: 08 June 2022
Abstract: Malware Security Intelligence constitutes the analysis of applications and their associated metadata for possible security threats. Application Programming Interfaces (API) calls contain valuable information that can help with malware identification. The malware analysis with reduced feature space helps for the efficient identification of malware. The goal of this research is to find the most informative features of API calls to improve the android malware detection accuracy. Three swarm optimization methods, viz., Ant Lion Optimization (ALO), Cuckoo Search Optimization (CSO), and Firefly Optimization (FO) are applied to API calls using auto-encoders for identification of most influential features. The nature-inspired wrapper-based algorithms are evaluated using well-known Machine Learning (ML) classifiers such as Linear Regression (LR), Decision Tree (DT), Random Forest (RF), K–Nearest Neighbor (KNN) & Support Vector Machine (SVM). A hybrid Artificial Neuronal Classifier (ANC) is proposed for improving the classification of android malware. The experimental results yielded an accuracy of 98.87% with just seven features out of hundred API call features, i.e., a massive 93% of data optimization.
Keywords: Android malware; API calls; auto-encoders; ant lion optimization; cuckoo search optimization; firefly optimization; artificial neural networks; artificial neuronal classifier
Android, a Linux-based mobile Operating System (OS), is the most popular mobile OS worldwide. Unlike other operating systems that are subject to numerous laws and copyrights, Android is open-source, allowing developers from all around the world to contribute to it. Android is frequently targeted by malware because of its widespread usage. Malware poses a threat to the data privacy, integrity, and availability. Attackers take aid of malware infiltrated into a genuinely looking regular Application (App). App downloads, in particular from unauthorized or not from play store, are the most prevalent route for malware to infiltrate the android OS. When the victim installs an app from unprotected sources, there is a possibility of malware attack. To avoid these attacks, more sophisticated malware detection techniques [1–3] are required due to the huge quantity of harmful malware applications.
Predicting Android malware [4,5] is possible with a number of existing technologies. However, these methods rely on signatures, which are digital traces beneath the code. The signatures are taken from the app’s Android Application Package (APK) and compared to malicious signatures in a database. This type of solution, on the other hand, is incapable of detecting malware that isn’t in the database. As malware is becoming more prevalent, it is vital to present a solution that can accurately detect all varieties of malware [6], while using the least amount of time and resources possible.
Many attempts have been made to identify malware on android devices [7,8] traditional, signature-based malware detection methods compare the APK file signature to the signature of the malicious program in the malware database, which excludes malware that isn’t in the database. In this scenario, there is a necessity to develop advanced detection techniques [9] for the effective identification of malware. This paper contributes to novel android malware detection based on API call features using a hybrid model of swarm optimization algorithms (ALO, CSO, and FO) along with auto-encoders evaluated on several ML classifiers. The main contributions of this paper are:
1. Detecting suspicious APIs for accurate classification of goodware and malware android apps.
2. Design and implementation of hybrid auto-encoders and swarm optimization wrapper feature minimization methods.
3. Evaluation of the proposed models using various ML classifiers including novel artificial neuronal classifier.
4. Determining the best algorithm for predicting android malware based on a variety of factors.
The remainder of the paper is ordered as follows: Section 2 presents the related work, methodology of the proposed work explained in Section 3, Section 4 details the experimentation setup, performance analysis and experimentation results are described in Section 5, and Section 6 presents the conclusion & future work.
Although the android platform is the most comprehensive, the variety and number of malware have increased dramatically. As a result, researchers began looking for ways to detect and block these dangerous applications [10–12] Tehrany et al. [13] addressed the problem of imbalanced datasets in android malware detection using statistical analysis. Ranking methods, under-sampling, and Synthetic Minority Oversampling (SMOTE) are used for pre-processing and balancing the dataset. SVM, KNN, and Iterative Dichotomiser 3 (ID3) classifiers are used for the detection model. The SMOTE technique with KNN classifier outperformed other comparing methods with an accuracy of 98.69%. Dharmalingam et al. [14] experimented with Term Frequency-Inverse Document Frequency (TF-IDF) for the identification of malware in android. The permissions are ranked and graded using permission grader. The graded permissions are classified using Artificial Neural Networks (ANNs) and obtained an accuracy of 94.22% when compared to competing algorithms. Yildiz et al. [15] demonstrated a feature selection method based on linear regression for improving the performance of android malware classification. The experimented methodology resulted in improved accuracy of 96.1% with reduced training time.
Sarah et al. [16] introduced a model based on the recursive feature selection method combined with an ensemble classifier for android malware detection. The influential features are selected using the Recursive Feature Selection method and classified using the LightGBM classifier. The results indicate the experimented process proved its efficiency in classifying android malware with 99.5% accuracy. Elayan et al. [17] addressed the inefficiency of traditional malware detection techniques and upgraded to use deep learning techniques. Gated Recurrent Unit (GRU) is employed for the classification of good ware from malware. The obtained model outperformed the classical methods with 98.2% accuracy. Arif et al. [18] demonstrated a risk-based fuzzy approach using the Analytical Hierarchy Process (AHP) for mobile malware detection. Further to detect the malware, the risk is categorized into four different categories (very low, low, medium & high) and is also explored. The overall approach achieved an accuracy of 90.54%.
The suggested method in this paper focuses on incorporating artificial neural networks to improve the effectiveness of android malware application detection [19] and classification [20,21]. The most influential features for distinguishing good ware apps from malware apps are first recognized by employing auto-encoders in wrapper-based swarm optimization feature selection methods. Second, a new artificial neuronal classifier for productive android malware classification is tested by combining artificial neural networks and induction classifier.
The architecture of the proposed wrapper-based feature selection technique using auto-encoders for android malware detection is highlighted in Fig. 1. The entire data is categorized into train and test sets in 7:3 ratio. The malware analysis process comprises two steps namely: feature selection & classification. In feature selection, swarm-intelligence-based ALO, CSO & FO algorithms are examined for iterative feature search selection. The selected features are passed onto auto-encoders to obtain a compressed form of input features. The auto-encoder’s output is transferred to an induction algorithm to find the fitness of features in classifying malware from good ware. The induction algorithm induces a classifier by mapping feature space into a set of class values, useful in classifying future cases. Finally, in classification, the obtained reduced feature set is evaluated using popular induction algorithms and the proposed artificial neuronal classifier.
The process of extracting the most consistent, relevant, and non-redundant features to employ in model creation is known as feature selection. The purpose of feature selection is to improve the model’s performance while lowering the modelling cost. It reduces the number of input variables in a machine learning model by eliminating redundancies and unnecessary features and then restricting the set of features to those that are most appropriate for the model.
The primary advantages of completing feature selection in advance rather than relying on the ML model to determine which features are most important are: simple models, reduction in variance, reduced training time of the model & avoiding high dimensionality curse.
Definition 1. For a given inducer I and data set D having features
In unsupervised feature selection techniques, the wrapper-based approach identifies the best combination of features that maximizes the performance of the model. By analyzing different models with the addition and/or removal of features using the greedy approach, the most influencing features are chosen for model building. The wrapper-based feature selection process is represented in Fig. 2
The swarm-intelligence-based ALO, CSO & FO feature selection search techniques are employed in order to replace the greedy approach. For the determination of near-optimal solution during fitness convergence, the choice of objective function plays a pivotal role. In the iterative approach of wrapper-based feature selection, the selected features count and the error obtained from the model after every iteration are considered to evaluate the fitness of the chosen features in the proposed method as shown in Eq. (1)
where,
Auto-encoders are a class of neural networks used to learn a compressed representation of input data. An auto-encoder consists of sub-models called an encoder and a decoder. The encoder model learns from input features and compresses them, while the decoder tries to regenerate the input from the compressed output of the encoder. Once the encoder model is trained, the decoder is ignored. The trained encoder is now used to extract features from raw data for the purpose of training machine learning models.
As shown in Fig. 3 the proposed auto-encoder comprises of an encoder with an input layer having N nodes, two hidden layers having
where,
here,
3.1.2 Wrapper-Based Ant Lion Optimized Feature Selection
The Ant Lion Optimizer developed by Seyedali [22] characterizes the hunting mechanics of antlions for ants in nature. ALO finds the ideal solution regardless of the initial values of the parameters. The convergence of ALO is fast and can handle integer & discrete constraints. The steps involved in prey hunting: ants random walk, building traps, entrapment of ants in traps, prey catching, and rebuilding the traps are implemented.
Initially, all the ants and antlions populations are initialized randomly. For each ant, antlions are selected based on the roulette wheel and a random walk is created and normalized using lines 9–10 and Eq. (5).
In Algorithm 1,
3.1.3 Wrapper-Based Cuckoo Search Optimized Feature Selection
The Cuckoo search algorithm developed by Yang et al. [23] depicts the egg-laying behavior of cuckoos in the nests of other host birds. Each cuckoo lays an egg and drops it in a randomly chosen nest. The best nests (solutions) with high-quality eggs are passed down to the following generation. The host nests that are accessible are finite, and a host’s chances of discovering an alien egg are with probability (0, 1). All of the nests are randomly initialized during the initiation phase, however once the iterations begin, the Cuckoo traverses the solution space by modifying the nests through
A few nests are abandoned and replaced with new nests at the end of an iteration using line 11 of Algorithm 2 where,
3.1.4 Wrapper-Based Firefly Optimized Feature Selection
The Firefly Optimization algorithm developed by Lindfield et al. [24] mirrors firefly action to draw in different fireflies. The light intensity of the two fireflies is straightforwardly corresponding to their engaging quality, while the distance between them is contrarily proportionate. Assuming that there is definitely not a more brilliant firefly close by, the firefly will move aimlessly. The attraction between any two fireflies is influenced by the brightness of the firefly. The firefly with a reduced brightness shifts towards the firefly with a greater brightness. When there are no brighter fireflies, random movement is used. The attractive force between two given fireflies is calculated using line 15 of Algorithm 3 at a distance
The induction/classification algorithms [25,26] employed to evaluate the proposed android malware detection system are LR, DT, RF, KNN & SVM. Apart from these classifiers, a new hybrid classifier combining artificial neural networks with induction algorithm, called artificial neuronal classifier is proposed in this paper.
Artificial Neuronal Classifier
The architecture of the proposed artificial neuronal classifier is highlighted in Fig. 4. The artificial neuronal classifier is an association of artificial neural networks [27] with the induction classifier [28]. The features are given as input to the ANN and trained to understand the patterns among the features and the correlation between them. The acquired knowledge from ANN is transferred to the induction classifier. The induction classifier induces the knowledge space into the learning algorithm to maximize the accuracy in classifying malware from good ware.
After several experiments, the number of layers in ANN of ANC are configured as an input layer with N nodes, three fully connected hidden layers having M nodes each, another fully connected hidden layer with
where, M represents no. of nodes in hidden layer, N indicates no. of features, and
here,
The normalized mean of the gradients
The final update rule for each weight is given in Eq. (14).
After adjusting the weights of the neural network, the error between the predicted output and the actual output is calculated using a loss function. The Mean Absolute Error (MAE) is used in this model, which is given in Eq. (15).
here,
All experiments are performed on a 64-bit Windows 10 operating system having 2.30 GHz Intel® Core™ i5 processor, 2TB Hard Drive and 8 GB RAM with Jupiter platform configured to support machine learning and deep learning packages. The programming language used is Python 3.7.
The API call sequence data used in the experiment is collected from IEEE data port. The data contains 43,876 API call sequences having 100 features, among them 42,797 are malware and 1,079 are good ware API call sequences. The size of the dataset is 17.1 MB. The experimented data is collected using the cuckoo sandbox environment and verified using virus total [29].
5 Performance Analysis and Experimental Results
The proposed system for android malware detection confirms the classification accuracy through various classification analysis metrics such as Mean Squared Error (MSE), Root Mean Square Error (RMSE), Precision, Recall, F1-Score, and Accuracy which are defined as follows:
where,
In experiment no. 1, the ALO, CSO & FO algorithms wrapped with LR, DT, RF, SVM, and KNN are evaluated for their performance on the API calls sequence dataset. All the experiments are run for 10 iterations with 10 agents. The results of the analysis are listed in Tab. 1 and its graphical illustration is given in Fig. 5. The FO optimizer, when wrapped with the KNN classifier obtained better results with an 88% reduced feature set and an accuracy of 98.29%.
In experiment no. 2, initially, all the features are passed on to the Auto Encoders to obtain latent space vector, which is passed on to the wrapper feature selection search methods. The FO wrapper search method produced the best accuracy of 98.53% with the KNN classifier. The FO optimizer showcased its supremacy with an 88% reduction in the dimensionality of the feature set. The experiment results are listed in Tab. 2 and its graphical illustration is shown in Fig. 6. Furthermore, in experiment no. 3, the ALO, CSO & FO are wrapped with the proposed artificial neuronal classifier to evaluate the performance of the classifier. Out of all the combinations experimented within in artificial neuronal classifier, the ANN combined with RF achieved a dominant accuracy of 98.87% with 93% reduced feature space. The experimental results are shown in Tab. 3 and its graphical illustration is shown in Fig. 7.
The experimental results indicate that wrapper-based firefly optimized feature selection algorithm reduced the dimensionality of feature space by maintaining classification accuracy using ANC. The evaluation metrics of the artificial neuronal classifier with the WFOFS algorithm are presented in Fig. 8. The list of 7 features selected by wrapper-based firefly feature selection algorithm using artificial neuronal classifier are listed in Tab. 4.
The Area Under Curve_Receiver Operator Characteristic (AUC_ROC) curves are generated for FO algorithm wrapped with variants of ANC, which performed better than the comparison algorithms. When compared to the area under the complete feature set, the area under the AUC_ROC curve of the ANC classifier embedded with RF is smaller when employed with a reduced feature set. The AUC_ROC graphs for all the variants of the ANC classifier are represented from Fig. 9–13. According to the literature survey, there is no feature selection algorithm using wrapper-based firefly optimization on API call sequence data. Hence, AUC_ROC comparison based on related work on API call sequence data is presented in Tab. 5.
This work investigates a hybrid dimensionality reduction strategy for feature space using an auto-encoder and swarm optimization. Initially, auto-encoders are given the entire feature set to investigate patterns among the features. To identify the most influential features, the gathered knowledge is fed into wrapper-based feature selection approaches. The machine learning model is subsequently trained to distinguish between good and bad Android applications using the restricted feature set. In addition to focusing on dimensionality reduction, this research proposes an artificial neuronal classifier, which combines artificial neural networks and machine learning approaches.
The wrapper-based feature selection techniques such as WALOFS, WCSOFS & WFOFS reduced the dimensionality of the feature space to a greater extent when incorporated with auto-encoders. Out of ALO, CSO & FO, the FO when wrapped with ML algorithms, the KNN classifier achieved better results in minimizing the size of the feature set to 88% having 98.29% accuracy without auto-encoders and to 92% reduced feature set size with 98.53% accuracy using auto-encoders. The ALO, CSO & FO when embedded with auto-encoders and wrapped with artificial neuronal classifier, the WALOFS outperformed other algorithms in reducing feature set dimensionality to 93% while maintaining an improved classification accuracy of 98.87%.
We will examine the performance of other deep learning models using a variety of feature extraction processes with dynamic datasets in the future. Furthermore, we would like to put classification systems to the test with adversarial threats based on malware imaging techniques. We would like to learn more about how neurons in each layer contribute to the feature extractor process for malware detection.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
1. Q. Han, V. S. Subrahmanian and Y. Xiong, “Android malware detection,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3511–3525, 2020. [Google Scholar]
2. A. D. Lorenzoa, F. Martinellib, E. Medveta, F. Mercaldobc and A. Santone, “Visualizing the outcome of dynamic analysis of android malware with VizMal,” Journal of Information Security and Applications, vol. 50, pp. 1–8, 2020. [Google Scholar]
3. J. Xu, Y. Li, R. Deng and K. Xu, “SDAC: A slow-aging solution for android malware detection using semantic distance based API clustering,” IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 3, pp. 1–15, 2020. [Google Scholar]
4. A. Mahindru and A. L. Sangal, “A feature selection technique to detect malware from android using machine learning techniques,” Multimedia Tools Applications, vol. 80, pp. 13271–13323, 2021. [Google Scholar]
5. H. Hasan, B. T. Ladani and B. Zamani, “MEGDroid: A model-driven event generation framework for dynamic android malware,” Information and Software Technology, vol. 135, no. 106569, pp. 1–16, 2021. [Google Scholar]
6. X. Liu, X. Du, Q. Lei and K. Liu, “Multifamily classification of android malware with a fuzzy strategy to resist polymorphic familial variants,” IEEE Access, vol. 8, pp. 156900–156914, 2020. [Google Scholar]
7. S. I. Rani and N. M. Sahib, “Detection of malware under android mobile application,” in 2020 3rd Int. Conf. on Engineering Technology and its Applications, Najaf, Iraq, pp. 179–184, 2020. [Google Scholar]
8. J. Jiang, S. Li, M. Yu, G. Li, C. Liu et al., “Android malware family classification based on sensitive opcode,” in IEEE Symp. on Computers and Communications (ISCC), Barcelona, Spain, pp. 1–7, 2019. [Google Scholar]
9. W. Wang, Y. T. Li, T. Zou, X. Wang, J. Y. You et al., “A novel image classification approach via dense-mobile net models,” Mobile Information Systems, vol. 2020, pp. 1–8, 2020. [Google Scholar]
10. N. Daoudi, K. Allix, T. F. Bissyande and J. J. Klein, “Lessons learnt on reproducibility in machine learning based android malware detection,” Empirical Software Engineering, vol. 74, pp. 1–53, 2021. [Google Scholar]
11. Z. H. Qaisar and R. R. Li, “Multimodal information fusion for android malware detection using lazy learning,” Multimedia Tools Applications, vol. 81, pp. 12077–12091, 2021. [Google Scholar]
12. H. Rathore, S. K. Sahay, P. Nikam and M. Sewak, “Robust android malware detection system against adversarial attacks using q-learning,” Information Systems Frontiers, vol. 23, pp. 867–882, 2021. [Google Scholar]
13. D. Tehrany and A. Rasoolzadegan, “A new machine learning-based method for android malware detection on imbalanced dataset,” Multimedia Tools Applications, vol. 80, pp. 24533–24554, 2021. [Google Scholar]
14. V. P. Dharmalingam and P. Visalakshi, “A novel permission ranking system for android malware detection-the permission grader,” Journal of Ambient Intelligence and Humanized Computing, vol. 12, pp. 5071–5081, 2021. [Google Scholar]
15. O. Yildiz and I. A. Dogru, “A novel permission-based android malware detection system using feature selection based on linear regression,” International Journal of Software Engineering and Knowledge Engineering, vol. 29, no. 2, pp. 245–262, 2019. [Google Scholar]
16. N. A. Sarah, F. Y. Rifat, M. S. Hossain and H. S. Narman, “An efficient android malware prediction using ensemble machine learning algorithms,” Procedia Computer Science, vol. 191, pp. 184–191, 2021. [Google Scholar]
17. O. N. Elayan and A. M. Mustafa, “Android malware detection using deep learning,” Procedia Computer Science, vol. 184, pp. 847–852, 2021. [Google Scholar]
18. J. M. Arif, M. F. A. Razak, S. R. T. Mat, S. Awang and N. S. N. Ismail, “Android mobile malware detection using fuzzy AHP,” Journal of Information Security and Applications, vol. 61, pp. 1–35, 2021. [Google Scholar]
19. W. Wang, J. Wei, S. Zhang and X. Luo, “LSCDroid: Malware detection based on local sensitive API invocation sequences,” IEEE Transactions on Reliability, vol. 69, no. 1, pp. 174–187, 2019. [Google Scholar]
20. H. Gao, S. Cheng and W. Zhang, “GDroid: Android malware detection and classification with graph convolutional network,” Computers & Security, vol. 106, no. 102264, pp. 1–14, 2021. [Google Scholar]
21. A. A. Taha and S. S. J. Malebary, “Hybrid classification of android malware based on fuzzy clustering and the gradient boosting machine,” Neural Computing and Applications, vol. 33, pp. 6721–6732, 2021. [Google Scholar]
22. M. Seyedali, “The ant lion optimizer,” Advances in Engineering Software, vol. 83, pp. 80–98, 2015. [Google Scholar]
23. X. S. Yang and S. Deb, “Cuckoo search via levy flights,” in World Congress on Nature & Biologically Inspired Computing, Coimbatore, India, pp. 210–214, 2009. [Google Scholar]
24. G. Lindfield and J. Penny, Nature-inspired Optimization Algorithms. Academic Press, Elsevier, pp. 85–100, 2017. [Google Scholar]
25. A. Mahindru and A. L. Sangal, “MLDroid—framework for android malware detection using machine learning techniques,” Neural Computing & Applications, pp. 5183–5240, 2020. [Google Scholar]
26. P. Ravi Kiran Varma, M. S. K. Reddy, K. Santosh Jhansi, and D. Pushpa Latha, “Bat optimization algorithm for wrapper based feature selection and performance improvement of android malware detection,” IET Networks, vol. 10, no. 3, pp. 131–140, 2021. [Google Scholar]
27. M. Kinkead, S. Millar, N. M. Laughlin and P. O. Kane, “Towards explainable CNNs for android malware detection,” Procedia Computer Science, vol. 184, pp. 959–965, 2021. [Google Scholar]
28. A. Ananya, A. Aswathy, T. R. Amal, P. G. Swathy, P. Vinod et al., “SysDroid: A dynamic ML-based android malware analyzer using system call traces,” Cluster Computing, vol. 23, pp. 2789–2808, 2020. [Google Scholar]
29. O. Angelo and S. R. Jose, “Behavioral malware detection using deep graph convolutional neural networks,” International Journal of Computer Applications, vol. 174, no. 29, pp. 1–8, 2019. [Google Scholar]
30. A. Cannarile, V. Dentamaro, S. S. Galantucci, A. A. Iannacone, D. Impedovo et al., “Comparing deep learning and shallow learning techniques for API calls malware prediction: A study,” Applied Sciences, vol. 12, no. 3, pp. 1–16, 2022. [Google Scholar]
This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |