Malware detection using static analysis with PCA, mRMR and machine learning
Abstract
Malicious software (malware) is software that harbors malicious intent and is harmful to computer systems. The number of malware being developed is increasing rapidly, and despite the use of anti-malware software, the timely detection of malware still remains a challenge today, with disastrous consequences that may result into losses valued in millions of dollars. Most anti-malware software today uses signature based detection techniques to protect legitimate users from malware attacks. Signatures are byte sequences that uniquely identify malicious software. However, this method fails to detect new types of malware, and new variants of existing malware for which no signatures exist in the signature databases. To address the short comings of signature based detection, researchers have proposed the use of statistical based detection, utilizing statistical properties of program features, and dynamic based detection that monitors the behavior of programs during execution. These techniques are used in conjunction with machine learning models that are trained on the selected features. Selecting individually good features does not necessarily translate into optimal classification results. There is therefore need to select optimal sets of features to use in building the machine learning models used in the detection of unknown malware. In this research, we evaluate Principle Component Analysis and Maximum Relevance and Minimum Reduction dimensionality reduction algorithms for the selection of optimal feature sets to use in building the machine learning models for detection of unknown malware. We evaluate different sets of features to determine the most parsimonious model with the lowest classification error. We show that the highest area under a receiver operating curve was 91% and was achieved with the Decision Tree classifier using 20 features selected using Maximum Relevance and Minimum Reduction.