Android has been dominating the smartphone market for more than a decade and has managed to capture 87.8% of the market share. Such popularity of Android has drawn the attention of cybercriminals and malware developers. The malicious applications can steal sensitive information like contacts, read personal messages, record calls, send messages to premium-rate numbers, cause financial loss, gain access to the gallery and can access the user’s geographic location. Numerous surveys on Android security have primarily focused on types of malware attack, their propagation, and techniques to mitigate them. To the best of our knowledge, Android malware literature has never been explored using information modelling techniques. Further, promulgation of contemporary research trends in Android malware research has never been done from semantic point of view. This paper intends to identify intellectual core from Android malware literature using Latent Semantic Analysis (LSA). An extensive
Android has been dominating the smartphone market for more than a decade and has managed to capture 87.8% of the market share [
These surveys have primarily focused on types of malware attack, their propagation, and techniques to mitigate them. Few surveys also highlighted fundamental vulnerabilities of Android platforms. To the best of our knowledge, Android malware literature has never been explored using information modelling techniques. Further, promulgation of contemporary research trends in Android malware has never been done from semantic point of view. This paper intends to identify intellectual core from Android malware literature using Latent Semantic Analysis (LSA). LSA mimics the human brain to filter out semantics from the text as it is mathematically proven to model words, synonyms, and metaphors to elaborate various semantic aspects of qualitative literature [
The rest of the paper is organized as follows: The second section introduces the available materials and methods along with the procedure to deploy LSA. The third section explains experimental results, different topic solutions, research trends, core research areas and their mapping. Section 4 concludes the findings.
Automated topic modelling techniques require minimal human intervention and can process thousands of articles in one go. However, manual review process does require human intervention at every step and can be biased sometimes [
This section details the methodology to deploy LSA on Android malware literature. The keywords used to search research articles and adopted inclusion-exclusion criteria for selecting articles are detailed here. A manual search was performed across previously mentioned databases using the following search terms “malware” OR “vulnerability” OR “security” OR “privacy” OR “monitoring” OR “application” OR “smartphone” OR “android” OR “virus” OR “static” OR “dynamic” OR “detection” OR “data flow” search keywords, with Android as prefix. Prominent research databases like Google Scholar, Mendeley, ACM DL, Hindawi, Taylor and Francis, IEEE, Wiley and Scopus were searched to identify quality literature related to Android malware. The inclusion and exclusion criteria followed for selection of articles is mentioned in
S. No. | Inclusion criteria | Exclusion criteria |
---|---|---|
1. | Articles must be published in the time frame 2009-2019. | Articles that were focusing on an operating system other than Android, e.g., BlackBerry, Symbian, iOS, Windows were excluded. |
2. | Articles must have a focus on Android security, malware analysis, and malware detection and mitigation techniques. | Articles that did not focus Android security, malware analysis, malware detection, mitigation techniques and Android security threats were excluded |
Initially, a total of 1289 abstracts and titles of articles published during 2009–2019 were collected by searching previously mentioned keywords. From amongst the collected documents, 251 duplicate entries were removed. Remaining 1038 articles were accessed and evaluated as per decided inclusion/exclusion criteria. Articles focusing on general malware (39), iOS (45), Symbian operating system (50), windows (61) were excluded. Finally, we were left with 843 articles to be processed using LSA. Owing to the required brevity of manuscript, the method to deploy LSA has been elaborated below with the help of an example:
Assume Sample Doc1 and Sample Doc2 as documents within a given document
Steps | Document No. | Result |
---|---|---|
After Tokenization | Doc1 | [‘Malware’, ‘application’, ‘reads’, ‘the’, ‘unique’, ‘device’, ‘identifier’, ‘to’, ‘track’, ‘the’, ‘user’, ‘s’, ‘device’ ‘Malware’, ‘applications’, ‘can’, ‘misuse’, ‘the’, ‘user’, ‘data’, ‘like’, ‘his’, ‘or’, ‘her’, ‘phone’, ‘numbers’, ‘contact’, ‘list’, ‘calendar’, ‘etc.’] |
Doc2 | [‘Applications’, ‘can’, ‘track’, ‘down’, ‘the’, ‘exact’, ‘location’, ‘of’, ‘the’, ‘user’, ‘by’, ‘finding’, ‘the’, ‘wifi’, ‘network’, ‘or’, ‘tower’, ‘it’, ‘is’, ‘connected’, ‘to’ ‘‘Various’, ‘malware’, ‘applications’, ‘can’, ‘record’, ‘your’, ‘daily’, ‘usage’, ‘data’, ‘and’, ‘send’, ‘it’, ‘to’, ‘servers’, ‘Applications’, ‘can’, ‘access’, ‘the’, ‘message’, ‘logs’, ‘and’, ‘misuse’, ‘them’] | |
After Normalization | Doc1 | Malware application reads the unique device identifier to track the user s device Malware applications can misuse the user data like his or her phone numbers contact list calendar etc. |
Doc2 | Applications can track down the exact location of the user by finding the wifi network or tower it is connected to Various malware applications can record your daily usage data and send it to servers Applications can access the message logs and misuse them | |
After Removing stop words | Doc1 | [‘Malware’, ‘application’, ‘reads’, ‘unique’, ‘device’, ‘identifier’, ‘track’, ‘user’, ‘device’, ‘Malware’, ‘applications’, ‘misuse’, ‘user’, ‘data’, ‘like’, ‘phone’, ‘numbers’, ‘contact’, ‘list’, ‘calendar’, ‘etc.’] |
Doc2 | [‘Applications’, ‘track’, ‘exact’, ‘location’, ‘user’, ‘finding’, ‘wifi’, ‘network’, ‘tower’, ‘connected’, ‘Various’, ‘malware’, ‘applications’, ‘record’, ‘daily’, ‘usage’, ‘data’, ‘send’, ‘servers’, Applications’, ‘access’, ‘message’, ‘logs’, ‘misuse’] | |
After Stemming and Lemmatizing | Doc1 | [‘malwar’, ‘applic’, ‘read’, ‘uniqu’, ‘devic’, ‘identifi’, ‘track’, ‘user’, ‘devic’, ‘malwar’, ‘applic’, ‘misus’, ‘user’, ‘data’, ‘like’, ‘phone’, ‘number’, ‘contact’, ‘list’, ‘calendar’, ‘etc.’] |
Doc2 | [‘applic’, ‘track’, ‘exact’, ‘locat’, ‘user’, ‘find’, ‘wifi’, ‘network’, ‘tower’, ‘connect’, ‘various’, ‘malwar’, ‘applic’, ‘record’, ‘daili’, ‘usag’, ‘data’, ‘send’, ‘server’, ‘applic’, ‘access’, ‘messag’, ‘log’, ‘misus’] | |
Character Filtering | Doc1 | [‘malwar’, ‘applic’, ‘read’, ‘uniqu’, ‘devic’, ‘identifi’, ‘track’, ‘user’, ‘devic’, ‘malwar’, ‘applic’, ‘misus’, ‘user’, ‘data’, ‘like’, ‘phone’, ‘number’, ‘contact’, ‘list’, ‘calendar’] |
Doc2 | [‘applic’, ‘track’, ‘exact’, ‘locat’, ‘user’, ‘find’, ‘wifi’, ‘network’, ‘tower’, ‘connect’, various’, ‘malwar’, ‘applic’, ‘record’, ‘daili’, ‘usag’, ‘data’, ‘send’, ‘server, ‘applic’, ‘access’, ‘messag’, ‘misus’] |
As the task is to mine the relevant terms that provide useful or quality information about the document, the document-term matrix has to be replaced by Term-Frequency/Inverse Document Frequency (TF-IDF) weights for further processing of the matrix. TF-IDF weight is a measure to interpret the importance of a term to a document in a collection of the large
Term | Doc1 | Doc2 | Term | Doc1 | Doc2 | Term | Doc1 | Doc2 |
---|---|---|---|---|---|---|---|---|
applic | 2 | 2 | exact | 0 | 1 | misus | 1 | 0 |
calendar | 1 | 0 | find | 0 | 1 | network | 0 | 1 |
connect | 0 | 1 | identifi | 1 | 0 | number | 1 | 0 |
contact | 1 | 0 | like | 1 | 0 | phone | 1 | 0 |
daili | 0 | 1 | list | 1 | 0 | read | 1 | 0 |
data | 1 | 1 | locat | 0 | 1 | record | 0 | 1 |
devic | 2 | 0 | malwar | 2 | 1 | send | 0 | 1 |
server | 0 | 1 | uniqu | 1 | 0 | various | 0 | 1 |
tower | 0 | 1 | usag | 0 | 1 | wifi | 0 | 1 |
track | 1 | 1 | user | 2 | 1 |
Documents | Term frequency scores |
---|---|
{‘malwar’: 0.1, ‘applic’: 0.1, ‘read’: 0.05, ‘uniqu’: 0.05, ‘devic’: 0.1, ‘identifi’: 0.05, ‘track’: 0.05, ‘user’: 0.1, ‘misus’: 0.05, ‘data’: 0.05, ‘like’: 0.05, ‘phone’: 0.05, ‘number’: 0.05, ‘contact’: 0.05, ‘list’: 0.05, ‘calendar’: 0.05} | |
{‘applic’: 0.105, ‘track’: 0.053, ‘exact’: 0.053, ‘locat’: 0.053, ‘user’: 0.053, ‘find’: 0.053, ‘wifi’: 0.053, ‘network’: 0.053, ‘tower’: 0.053, ‘connect’: 0.053, ‘various’: 0.053, ‘malwar’: 0.053, ‘record’: 0.053, ‘daili’: 0.053, ‘usag’: 0.053, ‘data’: 0.053, ‘send’: 0.053, ‘server’: 0.053 } |
Terms | IDF score | Terms | IDF score | Terms | IDF score | Terms | IDF score |
---|---|---|---|---|---|---|---|
applic | 1.000000 | list | 1.405465 | tower | 1.405465 | find | 1.405465 |
calendar | 1.405465 | locat | 1.405465 | track | 1.000000 | identifi | 1.405465 |
connect | 1.405465 | malwar | 1.000000 | uniqu | 1.405465 | like | 1.405465 |
contact | 1.405465 | misus | 1.405465 | usag | 1.405465 | record | 1.405465 |
daili | 1.405465 | network | 1.405465 | user | 1.000000 | send | 1.405465 |
data | 1.000000 | number | 1.405465 | various | 1.405465 | ||
devic | 1.405465 | phone | 1.405465 | wifi | 1.405465 | ||
exact | 1.405465 | read | 1.405465 | server | 1.405465 |
Hence, after the calculation of TF and IDF scores, the final document-term matrix with TF-IDF scores is calculated with the following
where t denotes the terms; d denotes each document; N denotes the total number of documents. Consider
Here, A represents the TF-IDF matrix, U represents documents-to-concepts matrix describing associations between concepts and terms, V represents terms-to-concepts matrix describing associations between documents rooted to various concepts and ∑ represents a diagonal matrix with non-negative real numbers, arranged in descending order. Moreover, these diagonal values represent the relative strength of each concept. A maximum number of concepts (also known as topics) cannot be more than the total number of documents rather its value needs to be adjusted to develop a latent semantic representation of the original matrix. Let’s assume d is the total number of documents, t is the total number of terms in all the documents and k is considered as the hyperparameter indicating the number of topics to be extracted from the textual data. Ak is the low-rank approximation of matrix A and can be produced using truncated SVD as follows in
where Uk is the document-to-topic matrix (t × k), Vk is the term-to-topic matrix (d × k), and
Terms | Doc1 | Doc2 | Terms | Doc1 | Doc2 | Terms | Doc1 | Doc2 |
---|---|---|---|---|---|---|---|---|
0.309883 | 0.344626 | 0.217765 | 0.000000 | 0.000000 | 0.242180 | |||
0.217765 | 0.000000 | 0.217765 | 0.000000 | 0.000000 | 0.242180 | |||
0.000000 | 0.242180 | 0.000000 | 0.242180 | 0.000000 | 0.242180 | |||
0.217765 | 0.000000 | 0.309883 | 0.172313 | 0.154942 | 0.172313 | |||
0.000000 | 0.242180 | 0.217765 | 0.000000 | 0.217765 | 0.000000 | |||
0.154942 | 0.172313 | 0.000000 | 0.242180 | 0.000000 | 0.242180 | |||
0.435530 | 0.000000 | 0.217765 | 0.000000 | 0.309883 | 0.172313 | |||
0.000000 | 0.242180 | 0.217765 | 0.000000 | 0.000000 | 0.242180 | |||
0.000000 | 0.242180 | 0.217765 | 0.000000 | 0.000000 | 0.242180 | |||
0.217765 | 0.000000 | 0.000000 | 0.242180 |
Terms | Topic 1 | Topic 2 | Terms | Topic 1 | Topic 2 | Terms | Topic 1 | Topic 2 |
---|---|---|---|---|---|---|---|---|
0.411164 | −0.028694 | 0.136800 | 0.179853 | 0.152138 | −0.200017 | |||
0.136800 | 0.179853 | 0.136800 | 0.179853 | 0.152138 | −0.200017 | |||
0.152138 | −0.200017 | 0.152138 | −0.200017 | 0.152138 | −0.200017 | |||
0.136800 | 0.179853 | 0.302917 | 0.113620 | 0.205582 | −0.014347 | |||
0.152138 | −0.200017 | 0.136800 | 0.179853 | 0.136800 | 0.179853 | |||
0.205582 | −0.014347 | 0.152138 | −0.200017 | 0.152138 | −0.200017 | |||
0.273601 | 0.359705 | 0.136800 | 0.179853 | 0.302917 | 0.113620 | |||
0.152138 | −0.200017 | 0.136800 | 0.179853 | 0.152138 | −0.200017 | |||
0.152138 | −0.200017 | 0.136800 | 0.179853 | 0.152138 | −0.200017 | |||
0.136800 | 0.179853 | 0.152138 | −0.200017 |
Topic 1 | Topic 2 | |
---|---|---|
0.795922 | 0.605399 | |
0.795922 | −0.605399 |
Using procedure detailed in Section 2 and weighting scheme as given in
The weighted matrix TF-IDF obtained after preprocessing steps was provided to the SVD to further perform rank lowering. The SVD model
U: Initial rotation
∑: Scaling
V: Final rotation
The mathematical expression XXt and XtX provides term-loading and document-loading respectively. ∑∑^t represents the weights of the topics in descending order. The maximum number of topics generated was equal to the number of documents in the
Optimal topic loadings come from dimensionality reduction. It offers a detailed analysis of obtaining k optimal terms or values from the term matrix produced by it. Selecting an optimal topic value has been difficult because of its requirement to understand and requiring several procedures to obtain favorable value [
Topic No. | Topic label | Loading terms |
---|---|---|
T5.1 | Application Structure Analysis | component, intent, receiver, broadcast, leak, vulnerability, string, class, object, activity, privilege, developer, sensitive, analysis, content, explicit, application, action, android, permission |
T5.2 | Static Level Monitoring | signature, bytecode, graph, context, dalvik, flow, permission, component, control, library, program, service, method, object, entry, event, field, code, data, path |
T5.3 | Automatic Malware Analysis | machine, accuracy, dataset, class, performance, positive, family, false, application, similarity, proceeding, experiment, signature, pattern, dynamic, android, vector, static, score, graph |
T5.4 | Hybrid Level Monitoring | dynamic, analysis, static, cloud, taint, application, instruction, execution, component, sensitive, bytecode, android, library, program, native, object, string, dalvik, class, event |
T5.5 | Dynamic Level Monitoring | kernel, privilege, escalation, policy, control, enforcement, security, exploit, memory, vulnerability, library, native, context, component, Linux, mechanism, access, resource, sandbox, virtual |
The results in
In the
Terms | Doc1 | Doc2 | Doc3 | Doc4 | Doc5 | ….. | Doc843 |
---|---|---|---|---|---|---|---|
kernel | 0.000000 | 0.000000 | 0.346366 | 0.000000 | 0.589463 | 0.000000 | |
dynamic | 0.176043 | 0.160859 | 0.170893 | 0.165134 | 0.280882 | 0.164157 | |
machine | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.344502 | |
accuracy | 0.000000 | 0.000000 | 0.000000 | 0.346553 | 0.000000 | 0.000000 | |
graph | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.344502 | |
The ten-topic solution as shown in
Considering the twenty-topic solutions, the prominent research trends in android security research are; the distribution of articles clearly interprets, “Machine Learning Approach” (T20.12), “Data Flow Tracking” (T20.7), “Context Monitoring” (T20.16), “Kernel Level Check” (T20.6), emerged as among highly explored topic solutions. This was consistent with the five, ten topic solutions.
The research trend, “Machine Learning Approach” (T20.12) is one of the highly explored topics over the last few years in which android applications are analyzed statically as well as dynamically to collect some set of features and then make the system learn it for making a decision about unknown sample of malicious applications. Machine learning methods were used in [
Topic No. | Five topic labels | Topic No. | Topic label |
---|---|---|---|
T5.1 | Application Structure Analysis | T20.2 | App Level Analysis |
T5.2 | Static Level Monitoring | T20.5 | Permission Based Analysis |
T20.7 | Data Flow Tracking | ||
T20.11 | Dex File Study | ||
T20.14 | Component Based Study | ||
T20.15 | Syntactic and Semantic Pattern | ||
T20.16 | Context Monitoring | ||
T20.17 | Feature Based Analysis | ||
T5.3 | Automatic Malware Analysis | T20.4 | Input Matching |
T20.12 | Machine Learning Approach | ||
T20.19 | Repackaged App Identification | ||
T20.20 | Formal Analysis | ||
T5.4 | Hybrid Level Monitoring | T20.1 | Obfuscated Code Analysis |
T20.3 | Hybrid Analysis | ||
T20.9 | Dynamic Code Loading | ||
T20.10 | Emulator Based Analysis | ||
T20.13 | Flow Monitoring | ||
T20.18 | Dalvik Byte Code Analysis | ||
T5.5 | Dynamic Level Monitoring | T20.6 | Kernel Level Check |
T20.8 | Classification Based on App Behaviour |
This section details the mapping of top five topic labels with top twenty topic solutions. Initial mapping of Topic (T5.1) “Application Structure Analysis” has a clear overlap with “App Level Features” (20.2). It uncovered the use of metadata and features of an Android application to detect and analyze Android malware. Metadata can be characterized as the displayed information which is available before downloading and installing the Android application, e.g., required permissions, description, version, last updated, rating, number of installations, developer information. This trend was seen in the project named WHYPER [
The results of this study revealed that “Static Level Monitoring” (T5.2) had been proved to be the most widely investigated topic in Android malware research. The studies related to static analysis majorly focus on network addresses, data flow tracking, control flow graphs, string matching, permissions, dex files, context, and intents. Studies also focused on behavioral and structural analysis to extend its coverage for advanced malware applications. Kernel-level analysis, API call monitoring, taints were its major highlights. Researchers identified that the results of the combined effects of structural and behavioral features produce richer and robust analysis. Numerous studies support hybrid techniques to detect destructive payloads. Though automation in malware analysis and analysis with supplementary techniques were comparatively less in number yet are effective enough to produce promising results.
To maintain the effective interpretations and comparisons among the topics; change of focus over two time-frames 2009–2013 and 2014–2019 were observed. It depicts the paradigm shift from time window 2009–2013 to 2014–2019. Machine learning approaches were found to be effective among other competitive approaches to detect Android malware. These approaches are well explored and promising during the time 2014–2019. The trend kernel-level check greatly influenced the research community during 2014–2019. Applications were inspected at the kernel level in order to understand their real-time behavior. Detection of piggybacked application through sensitive graph analysis/data tracking followed by usage of machine learning algorithms was widely studied during 2014–2019. Application permissions have remained the topmost static features to detect Android malware. It has been widely investigated during 2009–2019, as it poses as the first barrier to the malware authors. To activate certain events in an Android ecosystem, some specific permission should be declared in the manifest file. Due to the ability of malicious applications to hide their actual behavior through the user interface, it became cumbersome to analyze all possible paths or inputs, while performing sandbox analysis. During 2014–2019, smart interaction solutions came into light which focused on generating activity, and function call graph using static analysis and exploring paths using dynamic analysis.
Reviewing literature manually may result in biased and incomplete inferences. This work systematically analyses a large
This investigation also identifies new future dimensions for researchers. The results of this study will help others to choose their areas of interest for their potential research along with the associated research trend. The most impacting factor of this work lies in that researchers can apply the same methodology in any other research fields by little or almost no changes.