One aspect of cybersecurity, incorporates the study of Portable Executables (PE) files maleficence. Artificial Intelligence (AI) can be employed in such studies, since AI has the ability to discriminate benign from malicious files. In this study, an exclusive set of 29 features was collected from trusted implementations, this set was used as a baseline to analyze the presented work in this research. A Decision Tree (DT) and Neural Network Multi-Layer Perceptron (NN-MLPC) algorithms were utilized during this work. Both algorithms were chosen after testing a few diverse procedures. This work implements a method of subgrouping features to answer questions such as, which feature has a positive impact on accuracy when added? Is it possible to determine a reliable feature set to distinguish a malicious PE file from a benign one? when combining features, would it have any effect on malware detection accuracy in a PE file? Results obtained using the proposed method were improved and carried few observations. Generally, the obtained results had practical and numerical parts, for the practical part, the number of features and which features included are the main factors impacting the calculated accuracy, also, the combination of features is as crucial in these calculations. Numerical results included, finding accuracies with enhanced values, for example, NN_MLPC attained 0.979 and 0.98; for DT an accuracy of 0.9825 and 0.986 was attained.
Many implementations have incorporated the Portable Executable (PE) header features to search for malware inside such files. It was proved by more than one implementation that some or many of those features can be employed to differentiate between benign and malicious files. Each work implements a strategy in studying and extracting the features that may affect the accuracy in detecting the malware. In previous implementations there were some observations that needed to be studied and proved like the number of features used, combinations of the feature set and the specific features used may also affect the accuracy level. This work will study those observations.
This work assumed that the selected features that will be used have the required positive impact on accuracy, this was assumed according to previous observation. Selecting the features was not just a matter of collecting a random feature set. It was also assumed that the number of features can be reduced and accuracy can be improved, also the combination of features and incorporating particular features could affect the accuracy positively, again this was not a random assumption it was based on some previous indications that needed to be proved.
Causing harm intentionally through changing, adding, or removing code from software to destroy its intended function is considered as malware. Significant limitations can be found in malware detection technology, despite numerous conducted studies [
A variety of techniques has been implemented by modern malware detectors [
Machine learning utilizes the behavioral and structural features of both benign and malware files in the malware classification model building process. This model will identify the samples as being good or infected [
Originally, Win32 native executable format is a Portable Executable (PE). PE specifications were derived from the UNIX Common Object File Format (COFF) the necessary information is encapsulated inside the data format structure that represents the PE. That information is used for the MS-Windows operating system loader, so that the management of the executable code can be carried out [
According to [
Many previous implementations discussed the differences between a traditional method of detecting malware using Signature-based techniques and a machine-learning-based technique; it is mentioned in [
A third technique utilized today that incorporates machine learning, is the heuristic technique, which has proven its success in many fields [
The Authors of [
According to [
On the other hand, a hiding information system is presented by [
According to [
In [
According to previous observations which showed that features of the PE header file when incorporated in detecting malware can either have a positive or negative effect on the malware detection accuracy; this work studied the different circumstances that could affect the accuracy, like the number of features, the collection of the features used, if there are some particular features which have more impact on accuracy than others and if it is possible to reach the optimized set of features which always can be used to discriminate between benign and malware files. This required a strategy in selecting the features in each run. Subgrouping the features according to previous observations and putting a criterion that defined the high accuracy, good accuracy and low accuracy, were the main factors that affect the process of reaching the aims of this work.
The reminder of this paper will include the following: Section 2 introduces the related work. Section 3 presents why targeting portable executable in this work. Section 4 will explain the utilized algorithms. Section 5 will present work criteria. Section 6 illustrates work methodology presenting the subgrouping method. Section 7 will discuss results with charts. Section 8 will present a discussion that includes work observation and a comparison to previous work and finally, conclusions will be presented in Section 9.
Many implementations aimed at avoiding hazards due to the misuse of PE files, using different aspects to detect malware.
The Authors of [
The Authors of [
An approach to differentiate malware and benign .exe files was introduced in [
The important role of features in a PE file is presented in [
To detect malware J. Bai et al. [
The Authors of [
Emphasizing on obtaining accurate decisions for malware classification and intrusion attacks through selecting and extracting features were done by [
The Authors of [
Raff et al. [
Deep learning together with Artificial Immune System (AIS) were used in [
Maleki et al. [
The Authors of [
The Authors of [
The Authors of [
The Authors of [
Reference [
A histogram of instruction opcodes was used in [
The Authors of [
A packing detection framework (PDF) is presented by [
KDD’99 data-set was used in [
The work of [
Long Short Term Memory (LSTM) has been used by [
Taking both the attacker’s point of view and the defender’s point of view was carried out by [
Detecting malicious code using deep learning models was done by [
An investigation to classification accuracy of malware was done by [
Portable Executable defines both the layout section and executable form. All .dll, .exe and .sys files incorporate portable executable which is in MS-Windows system looks like a data structure with binary form. Portable Executables include the file description’s information that is used by the MS-Windows loader. Adding to that the program code is hosted by the portable executable [
Portable Executable (PE) has three different headers the DOS Header, the File header and the optional header, each header includes a set of structures. According to previous implementations, some of these structures are considered to have an impact on accuracy, those structures are used as features. This work collected previously employed structures and used them to carry out the experiments of the work.
Some or many portable executable header features (structures) can be a source of threat to the operating system if they are used to host malware.
According to this fact several works were presented in the field of studying those portable executable header features, different methods were utilized to extract the most sensitive and effective features. Each work suggested a set of features to study, some of the features were common between more than one study such as the DllCharacteristics.
The Authors of [
The Authors of [
Liao [
This work employed a wide range of features. Majority of the features used by previous implementations except six features, one of them does not apply to the current work, four of them did not appear in the control dataset [
Appendix A include
Deciding algorithms used in this work was concluded after many tests on different algorithms, some of the tested algorithms were clustering algorithms others were classification algorithms. From the clustering algorithms used in these tests was the K-Means Clustering and from the classification algorithms used were Random Forest, Naïve Bayes and Support Vector Machine (SVM). It is stated in [
Besides the enhanced results obtained from using those two approaches, there are other reasons led to selecting them. Neural Network can handle a human like intelligence, it has the ability to simulate the human brain [
With the Decision Tree (DT) predictions are built by considering representing the values in a categorical and numerical forms [
The performance accuracy metrics formula used in this work was the Accuracy which can be defined as:
TP: is the True Positive. A sample with positive label and predicted to be positive.
TN: is the True Negative. A sample with negative label and predicted to be negative.
FP: is the False Positive. A sample with negative label and mistakenly predicted to be positive.
FN: is the False Negative. A sample with positive label and mistakenly is predicted to be negative.
Many previous implementations had studied the PE file header features, each implementation employed different strategies to extract the effective and sensitive features. Also, each implementation extracted different number of features. Some of the extracted features were common between the implementations such as DllCharacteristics. As an example [
Twenty-Nine features were used in this work. The source data set used to extract the required features in this work was obtained from [
The algorithms used in this work were: Decision Tree and Neural Network Multi-Layer Perceptron. The combinations used for test size and random state variables (test size, random state) are (0.3, 10) and (0.15, 3) in both algorithms. These values were selected after tests and recommendations from previous implementations [
Accordingly, four runs are implemented for each case study. These runs are referred to as follows:
NN_MLPC_0.3_10 refers to the (test size 0.3, random state 10) in Neural Network Multi-Layer Perceptron. NN_MLPC_0.15_3 refers to the (test size 0.15, random state 3) in Neural Network Multi-Layer Perceptron. DT_0.3_10 refers to the (test size 0.3, random state 10) in Decision Tree. DT_0.15_3 refers to the (test size 0.15, random state 3) in Decision Tree.
File features are considered very important in malware [
In this work, a new method was implemented to recognize positive and/or negative effects of features on accuracy, since all other implementations have selected certain features and many of the selected features were common, it was necessary to find out if those features can be studied further to check its effectiveness and if there is a possibility to group them in a way to get the optimal accuracy. Adding to that, to find out if there is a possibility to reach the lowest number of features that will produce the maximum accuracy. Hence, many runs were selected for each type with test_size, random_state pair. The runs were selected according to previous results in [
The main aims behind features subgrouping were, to determine which added features has a positive effect on accuracy? In previous work it was stated that addition of features improved accuracy, but do all the features have this effect? Secondly, is it possible to find a reliable set of features that can be used always to distinguish malware from benign in a portable executable file? Thirdly, does the combination of the features affect the accuracy of detecting malware in that file?
It is computationally impossible to try all combinations of all the 29 features since this will need a tremendous number of runs. Therefore, it is important to find a way to reduce the amount of work required to reach the highest accuracy values. One method is to start from any initial results established previously.
The Python 3.8 SciKit-Learn package was used in implementing the algorithm for this work taking into consideration the specific requirements of the problem, the general flow-chart is presented by
To discriminate between benign and malware files, this work grouped different tools to work as a system under security title.
It is important to mention that the computer used in this work is ACER, with Intel CORETM i7–7500 U, 2.7 GHz with TURBO Boost up to 3500. With NVIDIA GeForce MX130 with 2 GB VRAM. The computer employs an 8 GB DDR3 L Memory.
Studying the collected data from previous work was necessary to differentiate useful data from other data.
Accordingly, it was necessary to indicate what number of features produced maximum accuracy, good accuracy, or low accuracy so that indicators can be defined for the subgrouping method. According to previous observations and results obtained, it is possible to define three levels of accuracy as seen by the researchers:
High accuracy: in NN_MLPC, 0.97 considered as high accuracy and maximum accuracy obtained was within this range.In Decision Tree, things are different, DT gave accuracies within 0.97 in many cases, but also gave an accuracy within 0.98 in many other cases. So, in DT 0.97 and 0.98 will be considered as high accuracy and both accuracies will be taken into consideration in this work Good Accuracy: in NN_MLPC obtaining 0.95 and 0.96 considered as good accuracy. Low Accuracy: starting from 0.94 and less can be considered as low accuracy since low number of runs produced such accuracy or less.In DT, 0.94 or less will be considered also as low accuracy in comparison with the number of runs that gave higher accuracies.
To avoid doing all the possible combinations of the 29 features, a smaller number of features is selected for each of the four types of runs, i.e., subgrouping the 29 features according to results obtained previously. For each type, the following algorithm is implemented.
Define High accuracy, Good accuracy and Low accuracy ranges For each classification algorithm used i.e., NN_MLPC and DT and according to previous observations: Selecting the smallest number of features that produced an accuracy within the high accuracy. Creating a new set, lets name it set x. According to previous observations new several features were nominated to participate in the experiments. Do several run on set x, by replacing some features by others from the nominated features and keeping the number of features the same. Each run will include the replacement of one feature only. Adding more features from the nominated set of features, in steps and re-run the experiment. Using the nominated features, a replacement to the features in steps but keeping the number of features fixed. And re-run the experiment. Halting the process when the number of features reaches the number of features which produced the highest number of accuracy previously.
Deciding the number of features in these sets is made through studying the previous results, i.e., those sets of features that gave higher accuracy than others. The choice of exchange features is also based on the same idea.
The procedure is halted when the number of features used in a set reaches the number of features that gave maximum accuracy in previous work, since this number is sufficient. Tests were done for some types to add more features after reaching the number of features that gave the maximum accuracy; it was notable that accuracy decreased, sometimes the decrease was slight, but still, it is less than the maximum accuracy.
Implementation of the work required the following steps:
Creating the set of features in an independent .csv file, to prepare it to the run. The .csv file is imported to the program. Making runs for both algorithms using the two combinations of test_size and random_state. Totaling four runs for each csv file.
According to previous results in [
Noting that for the NN_MLPC runs with combinations of 15, 21, 25 and 28 features were used in addition to the combinations of 11 features, also combinations with less than 11 features were tested to check their impact.
Taking these steps of number of features comes from the observations in previous work [
According to previous results [
Using
Subsequent sets of runs included 15, 18, 20, 21, 22 and 23 features, to obtain the highest accuracy value.
Previous results stated that
Total number of runs in NN_MLPC were 77 runs. For DT were 70 runs.
In this work each type has a different number of runs and different added features. This is because observations from previous work were used as a guide for selecting the number of features each time. Each type produced different observations with different number of features, so, steps used in each type are different from the others.
Bar charts are used in this work to present the results obtained. This type of charts is self-explanatory and gives a better understanding. The charts are meant to show the range of accuracies that were obtained in the runs for each type, each chart will show how in each case study the accuracy changes. Before drawing the chart, all the accuracies obtained from each run were sorted from the smallest value to the highest value. What will be shown in each chart is how a variation of accuracies is obtained with different case studies. Each case study represents a set of features and number of features. In all the types, the last case study represents the maximum accuracy and the number of features used to attain it.
Lowest accuracy obtained when features 19 to 29 ( The maximum accuracy obtained was when using 15 features (0.97875595). Noting that 0.978 obtained also with 11 features, 21 features, 25 features and 28 features. But all the runs got a value less than the maximum value. Using 11 features first 0.97 accuracy obtained. Not all the sets of 11 features produced an accuracy of 0.97. other combinations of 11 features produced the lowest accuracy, same thing is implied on sets of features higher than 11 features. In [
As an example, in other run where 15 features are used, the accuracy obtained is: 0.95292318; whereas point B stated that when certain combination of 15 features is used, highest accuracy obtained.
Lowest accuracy obtained using 15 features; 29 down to 15 ( Highest accuracy obtained was: 0.98130523, using 22 features. It is the only run which gave an accuracy in the range of 0.98 between all the runs of NN_MLPC, other runs with 22 features produced a lower accuracy. Even in [ Using 13 features the first 0.97 obtained, like the NN_MLPC_0.3_10 not all the combinations of 13 features produced the 0.97, the only one set of 13 features produced 0.97. In general, if most features used from 20 to 29 a lower accuracy can be obtained, for example, when 13 features used (14 to 26) the accuracy obtained was 0.93847723, also, when 18 features were used 1 to 5 and 17 to 29, the accuracy obtained was 0.94323589, another example is when 23 features were used ten of them were 9 to 13 and 1 to 5, the remaining 13 features were 17 to 29, the accuracy obtained was 0.96057104.
Lowest accuracy obtained using 6 features 20 to 25 ( The highest accuracy obtained using 25 features, the accuracy obtained was 0.98249490. The second highest accuracy was 0.98164514, with 25 features. The high accuracy was between 0.971 and 0.982. Accuracy of 0.971 was obtained using 6 features only. It can be shown that using 9 features a low accuracy obtained, but this was because the used features were selected between 20-to-29. It was notable that 0.98 was repeated for many times. Starting from using 14 features, accuracy of 0.98113528 was obtained. Another 0.98 is obtained using different combination of 14 features which produced an accuracy of 0.98028552. With 25 features an accuracy of 0.97824609 is obtained, while three runs before, one with 18 features and two with 22 features obtained an accuracy of 0.98.
Lowest accuracy obtained with five features which are feature 21 to feature 25 ( Highest accuracy obtained was when using 20 features (1–7, 11–19 and 21–24 see Only five features gave 0.97 accuracy. Eight features gave the first 0.98 level of accuracy. Also, 10, 15 and18 sets of features obtained 0.98. twenty features produced maximum accuracy. Incorporating higher number of features does not reflect a high accuracy, for example, while eight features produced high accuracy like 0.98, increasing the number in many cases reduced the accuracy level.
Run no. | No. of features | Features used | Accuracy |
---|---|---|---|
Run132 | 5 | 21–25 | 0.76614548 |
Run133 | 5 | 25–29 | 0.83072739 |
Run139 | 7 | 21–27 | 0.84058464 |
Run151 | 10 | 20–29 | 0.84058464 |
Run135 | 6 | 21–23, 26–28 | 0.840924541 |
Run147 | 9 | 21–29 | 0.84092454 |
Run no. | No. of features | Features used | Accuracy |
---|---|---|---|
Run144 | 8 | 17–24 | 0.92624065 |
Run146 | 8 | 1–4, 21–24 | 0.95955133 |
Run136 | 6 | 1–3, 21–23 | 0.96057104 |
Run140 | 8 | 1–3, 21–25 | 0.96057104 |
To give an example of how the analysis for the results of this work is done; in addition to the charts in
Chose NN_MLPC_0.15_3 and DT_0.15_3 because these types produced the maximum accuracies in their classification algorithm.
The charts show that the accuracy changes with the features used in the set but not the number of features. For example, in
Observations in NN_MLPC_0.15_3 showed that using certain combination of 18 features got an accuracy of 0.9765, 20 features produced 0.976, other combination of 20 features produced 0.978, add to that 21 features gave 0.9769, 22 features gave 0.9786 and 23 features gave 0.9765, 0.978, 0.976 and 0.979, most of these results are better than the maximum accuracy given by previous implementations. Also, other combinations with the same number of features gave less accuracy even less than the good accuracy. This contributed to the importance of feature combinations used to produce an improved accuracy.
From what is mentioned in Section 7 (results) in NN_MLPC_0.3_10 and NN_MLPC_0.15_3 point C in addition to what was obtained from previous implementations, the first 0.97 is obtained using 11 features for NN_MLPC_0.3_10 and 13 features in NN_MLPC_0.15_3. This means that there is a need to use at least 11 features to get such accuracy in NN_MLPC_0.3_10 and 13 features in NN_MLPC_0.15_3, this contributes to the question about the impact of the number of features used on accuracy. This is applied also to all the remaining types. Each type required a different number of features to get a good accuracy.
Obtaining an accuracy less than 0.97 when using less than eleven or thirteen features for both types indicates that the number of features used in these cases is an important factor. However, not all features have a positive contribution to increase the accuracy. Features required to improve the accuracy need to be recognized, even with the same number of features.
Maximum accuracy obtained in NN_MLPC was 0.9813 when using 22 features in NN_MLPC_0.15_3, noting that this was the only set with this number that gave the maximum accuracy. Other sets with the same number of features gave lower values of accuracy. This shows the importance of finding the correct combination that must be used; note that this was the only combination of features which produced an accuracy in the range of 0.98, all previous implementations did not obtain such accuracy. This also applies to NN_MLPC_0.3_10 type.
The results showed that with DT using 5 or 6 features 0.97 range of accuracy was obtained, while in NN_MLPC using 11 features in NN_MLPC_0.3_10 and 13 features in NN_MLPC_0.15_3 gave the 0.97 accuracy. All statistics in DT are different from the NN_MLPC. However, the common facts found in both algorithms are those related to answering the questions on the effects of the number of features, the specific features used and the combinations of features.
In both algorithms and runs implemented, features 20 to 29 reduced the accuracy when they were used alone or as a majority in a bigger set. On the other hand, in more than one situation individual features, small sets or maybe the whole range from the range showed positive impact on accuracy when they are included in a larger set of other ranges. For example, when feature number 26 was included in one of the cases, highest accuracy was obtained, other cases showed positive contributions of individual features from 20 to 29. But in general, using majority or all the features from this range showed bad accuracies.
It is important to emphasize here that including features from 20–29 will behave differently if they were included in a bigger set of features, it was notable that many results in the range of highest accuracy was obtained when including some or even all the 20 to 29 features, but this was in runs when the number of features is high and the features used from the ranges 1 to 10 and 11 to 19 is bigger. This can be observed in Section 7
The results also showed that several individual features improve the accuracy. Some sets of features when included within larger sets have a positive impact on the accuracy in both NN_MLPC and DT. At the same time, the effect of these sets is different between NN_MLPC and DT. In general, these sets have the best impact, the difference between the two algorithms is what features to add with these sets to obtain the best accuracy.
Accordingly, it can be thought that some individual features or sets of features may take the role of positive impact whatever the algorithm.
In both algorithms, entering the region of high accuracy appeared with certain feature numbers; however, other features need to be added to get the highest accuracy.
Run type | No. of features | Test_size | Random_state | Accuracy |
---|---|---|---|---|
NN_MLPC_0.3_10 | 15 | 0.3 | 10 | 0.978756 |
NN_MLPC_0.15_3 | 22 | 0.15 | 3 | 0.981305 |
DT_0.3_10 | 25 | 0.3 | 10 | 0.982495 |
DT_0.15_3 | 20 | 0.15 | 3 | 0.986404 |
Starting with the procedure used in this work, subgrouping the feature for testing is the way that this work implemented to reach the highest accuracy with a smaller number of runs and a lower time cost. Other implementations used other strategies discussed in the Related Work section. The procedure used in this work showed the improved performance to obtain answers to the questions mentioned previously. For the numerical part of the results, what is obtained was more, the same, or slightly less than the results obtained in previous implementations. For example, in NN_MLPC, enhanced results were obtained. And in DT a slightly fewer results were obtained. The most important thing that this way obtained the required results.
Previous implementations used different strategies to collect and study the portable executable features, got good accuracies. Features used in those previous works were the features collected in this work.
Run type | Current work | Previous work | ||
---|---|---|---|---|
Accuracy | Features | Accuracy | No. of features | |
NN _MLPC _0.3_10 | 0.979 | 15 | 0.978 | 21 and 28 |
NN _MLPC _0.15_3 | 0.981 | 22 | 0.979 | 23 |
DT _0.3_10 | 0.982 | 25 | 0.984 | 25 |
DT _0.15_3 | 0.986 | 20 | 0.987 | 19 and 27 |
For DT_0.15_3
In DT_0.15_3, 20 features used in the current work gave an accuracy less by 0.001, while using 19 or 27 features in [
According to
1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
NN_Run 33 | 11 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 28 | 0.9781 | ||
NN_Run 35_1 | 11 | 1,2, 3, 4, 5, 10, 11, 12, 13, 21, 28 | 0.3 | 10 | 0.9507 |
NN_Run 36_1 | 11 | 1, 2, 3, 4, 5. 20, 21, 22, 23, 24, 28 | 0.3 | 10 | 0.9198 |
NN_Run 37 | 15 | 1, 2, 3, 4, 5, 15, 16, 17, 18, 19, 21, 23, 27, 28, 29 | 0.3 | 10 | 0.9618 |
NN_Run 39 | 15 | 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 21 | 0.3 | 10 | 0.9788 |
NN_Run 40 | 15 | 15,16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 | 0.3 | 10 | 0.9172 |
NN_Run 42 | 21 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 28 | 0.3 | 10 | 0.9728 |
NN_Run 44 | 21 | 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 1, 2, 3, 5, 13,14, 15, 16, 17, 18 | 0.3 | 10 | 0.9782 |
NN_Run 46 | 21 | 1, 2, 3, 5, 6, 7, 9, 10, 11, 13,14, 15, 17, 18, 19, 12, 22, 23, 25, 26, 27 | 0.3 | 10 | 0.9758 |
NN_Run 48 | 25 | 1, 2, 3, 5, 21, 22, 23, 24, 25, 26, 27, 28, 29, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 | 0.3 | 10 | 0.9786 |
NN_Run 50 | 25 | 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 1, 2, 3, 5, 6, 7, 8, 9 | 0.3 | 10 | 0.9720 |
NN_Run 52 | 28 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29 | 0.3 | 10 | 0.9777 |
NN_Run 56 | 13 | 25, 26, 27, 28, 29, 10, 11, 12, 13, 14, 1, 2, 3 | 0.15 | 3 | 0.9670 |
NN_Run 61 | 13 | 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25 | 0.15 | 3 | 0.9463 |
NN_Run 65 | 15 | 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 25, 26, 27, 28, 29 | 0.15 | 3 | 0.9755 |
NN_Run 68 | 15 | 21, 22, 23, 24, 25, 11, 12, 13, 14, 15, 1, 2, 3, 4, 5 | 0.15 | 3 | 0.9718 |
NN_Run 71 | 18 | 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 21, 22, 23, 24, 25, 9, 17, 29 | 0.15 | 3 | 0.9745 |
NN_Run 76 | 18 | 1, 2, 3, 4, 5, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 | 0.15 | 3 | 0.9432 |
NN_Run 79 | 20 | 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 25, 26, 27, 28, 29 | 0.15 | 3 | 0.9738 |
NN_Run 83 | 20 | 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 | 0.15 | 3 | 0.9782 |
NN_Run 85 | 21 | 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 25, 26, 27, 28, 29, 9 | 0.15 | 3 | 0.9735 |
NN_Run 88 | 22 | 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 1, 2 | 0.15 | 3 | 0.9742 |
NN_Run 93 | 23 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 22, 23, 24, 25, 26, 11, 13, 28, 15, 16, 17, 18, 19, 20 | 0.15 | 3 | 0.9725 |
NN_Run 100 | 23 | 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, | 0.15 | 3 | 0.9606 |
DT_Run 102 | 6 | 10, 11, 12, 13, 14, 15 | 0.3 | 10 | 0.9786 |
DT_Run 105 | 9 | 1, 2, 3, 7, 8, 9, 15, 20, 25 | 0.3 | 10 | 0.9718 |
DT_Run 108 | 9 | 24, 25, 26, 14, 15, 16, 4, 5, 6 | 0.3 | 10 | 0.9762 |
DT_Run 111 | 14 | 1, 2, 3, 7, 8, 9, 10, 27, 28, 29, 17, 18, 19, 20 | 0.3 | 10 | 0.9740 |
DT_Run 112 | 14 | 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 | 0.3 | 10 | 0.9803 |
DT_Run 117 | 16 | 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 | 0.3 | 10 | 0.9805 |
DT_Run 118 | 16 | 1, 2, 3, 4, 5, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 11 | 0.3 | 10 | 0.9726 |
DT_Run 120 | 18 | 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 | 0.3 | 10 | 0.9811 |
DT_Run 122 | 18 | 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 2, 14, 24 | 0.3 | 10 | 0.9810 |
DT_Run 123 | 22 | 1,2, 3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29 | 0.3 | 10 | 0.9813 |
DT_Run 124 | 22 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24 | 0.3 | 10 | 0.9801 |
DT_Run 128 | 25 | 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 | 0.3 | 10 | 0.9805 |
DT_Run 131 | 5 | 11, 12, 13, 14, 15 | 0.15 | 3 | 0.9721 |
DT_Run 134 | 6 | 1, 2, 3, 6, 7, 8 | 0.15 | 3 | 0.9796 |
DT_Run 137 | 7 | 11, 12, 13, 15, 16, 17, 25 | 0.15 | 3 | 0.9701 |
DT_Run 141 | 8 | 1, 2, 3, 11, 12, 13, 14, 15 | 0.15 | 3 | 0.9606 |
DT_Run 149 | 9 | 1,2, 3, 11, 12, 13, 27, 28, 29 | 0.15 | 3 | 0.9745 |
DT_Run 152 | 10 | 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 | 0.15 | 3 | 0.9748 |
DT_Run 155 | 15 | 5, 6, 7, 8, 9, 15, 16, 17, 18, 19, 25, 26, 27, 28, 29 | 0.15 | 3 | 0.9748 |
DT_Run 157 | 18 | 1, 2, 3, 4, 5, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 11, 7, 29 | 0.15 | 3 | 0.9806 |
DT_Run 160 | 20 | 1, 2, 3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27 | 0.15 | 3 | 0.9823 |
DT_Run 162 | 20 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 25, 26, 27, 28, 29 | 0.15 | 3 | 0.9820 |
DT_Run 166 | 22 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24 | 0.15 | 3 | 0.9810 |
DT_Run 169 | 27 | 1,2, 3, 4, 5, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 7, 9, 11, 13, 14, 6, 7 | 0.15 | 3 | 0.9837 |
It can be noticed in
Accuracies of 98% and 97% were attained in [
Column names of NN or DT and Run Number: where NN stands for NN_MLPC. DT stands for Decision Tree. NN_Run No. which stands for NN_MLPC with run number, or DT_Run No. for DT and the run number Number of features Features Used Test_Size Random_State Accuracy obtained.
Researchers in previous works stated their strategies and plans to study Portable Executable files, selecting the features to be used, some of the researchers made their studies on packed files, but still the features of PE file played the main role in distinguishing benign from malicious files.
Twenty-nine features were used in the study, which were previously collected in other work and studied them using subgrouping method to observe the accuracy each time. An independent data set were prepared and saved for each run.
Two classification algorithms were used NN_MLPC and DT. Four types of runs were carried out, NN_MLPC_0.3_10, NN_MLPC_0.15_3, DT_0.3_10 and DT_0.15_3. The numbers included in each type represents the test_size and random_state, respectively.
Results obtained from this work has practical part and numerical part. Practical part was the observations and the inferences which is summarized with the factors impacting the accuracy of malware detection, those factors are the number of features used, the features included and the combination of features used. For the numerical results accuracies of 0.979 and 0.981 for NN_MLPC. As for DT 0.9825 and 0.986 were obtained.
The results obtained in this work was an enhanced result sometimes and equals or slightly less than previous work other time, this was the main aim of the work, when a slightly less accuracy obtained means that there is certain impact of the features on the accuracy, so those features can be studied more or be taken into consideration in future to not include them since their effect were negative.
The process of subgrouping showed its high efficiency comparing to previous implementations; using certain subsets of features in steps according to previous observations till obtaining the highest accuracy can reduce the time and produce good results.
In this work, it was shown that the number of features used, the specific features included and the combination of features play the major role in accuracy. Thus, using a high number of features is not sufficient by itself to obtain high accuracy, add to that high number of features may lead to fall into the curse of dimensionality problem.
For the future work, according to the enhanced results obtained in this work, a deeper study to the features with different methodology can be carried out, also, incorporating more features to see their impact on accuracy. The idea includes employing more performance measurements metrics like sensitivity, specificity, precision, f_measure and/or G-mean.
My gratitude and thanks to Dr. Harith J. Al-Khshali (my father), who helped and supported me in this work, unfortunately, he passed away before completing the final versions of the work, it is a big loss to the science.
Number | Feature name | Number | Feature name |
---|---|---|---|
1 | DllCharacteristics | 16 | SubSystem |
2 | MajorImageVersion | 17 | MinorImageVersion |
3 | MajorOperatingSystemVersion | 18 | SizeOfStackCommit |
4 | SizeOfStackReserve | 19 | e |
5 | AddressOfEntryPoint | 20 | e |
6 | Characteristics | 21 | e |
7 | SizeOfHeaders | 22 | Machine |
8 | SizeOfInitializedData | 23 | PointerToSymbolTable |
9 | SizeOfUninitializedData | 24 | NumberOfSymbols |
10 | MajorSubsystemVersion | 25 | Magic |
11 | MinorSubsystemVersion | 26 | SizeOfCode |
12 | CheckSum | 27 | BaseOfCode |
13 | ImageBase | 28 | SectionAlignment |
14 | MajorLinkerVersion | 29 | FileAlignment |
15 | NumberOfSections |