KuafuDet    
Automated poisoning attacks and defenses in malware detection systems: An adversarial machine learning approach

Overview

Today, sophisticated attackers can adapt by maximally sabotaging machine-learning classifiers via polluting training data, rendering most recent machine learning-based malware detection tools (such as Drebin and DroidAPIMiner) ineffective.

We explore the feasibility of constructing crafted malware samples; examine how machine-learning classifiers can be misled under three different threat models; then conclude that injecting carefully crafted data into training data can significantly reduce detection accuracy.

We propose KuafuDet, a two-phase learning enhancing approach that learns mobile malware by adversarial detection. KuafuDet includes an offline training phase that selects and extracts features from the training set, and an online detection phase that utilizes the classifier trained by the first phase.

To further address the adversarial environment, these two phases are intertwined through a self-adaptive learning scheme, wherein an automated camouflage detector is introduced to filter the suspicious false negatives and feed them back into the training phase. We finally show KuafuDet significantly reduces false negatives and boosts the detection accuracy by at least 15%.

Contact: ecnuchensen AT gmail.com


Publication

Automated poisoning attacks and defenses in malware detection systems: An adversarial machine learning approach

Sen Chen, Minhui Xue, Lingling Fan, Shuang Hao, Lihua Xu, Haojin Zhu and Bo Li   Elsevier Computer & Security (accepted)


Policy

We make the malicious Android applications used in the paper as well as our experiments results publicly availale to other researchers.

If you are interested in getting access to our experiments results, please read the following instructions carefully.

(1) If you are currently in academia:

(a) If you are a student, please ask your advisor to send us an email for the access. If you are a faculty, please send us the email from your university's email account.

(b) In your email, please include your name, affiliation, and homepage.

(2) If you are currently in industry:

(a) Please send us an email from your company's email account. In the email, please briefly introduce yourself (e.g., name) and your company.

Please send your request emails to Sen Chen (ecnuchensen AT gmail.com)


Dataset

Our dataset (252,900 APKs) consists of 242,500 benign applications that are downloaded from Google Play Store, and the other 10,400 malicious APK files where 1,260 have been validated in Genome project and the remaining are downloaded from Drebin (4,300 APKs), Pwnzen Infotech Inc and Contagio (340 APKs).

The collection of dataset used was strictly following the Privacy Policy of the Pwnzen Infotech Inc., and conformed to the non-disclosure agreement (NDA) of the Pwnzen Infotech Inc. What we can release, however, are the malicious dataset of Contagio, which we used as subset of our experiments.

Features

The features considered in this study are classified into two categories: syntax features (175) and semantic features (20).

Training files (camouflaged)

We propose that poisoning attack can be exhibited by three types of attackers (i.e., weak, strong, sophisticated attacker) in the real world. We show that our poisoning attack is able to mislead machine learning classifiers (e.g., SVM, RF, KNN).