Bigdata, machine learning and malware detection

演讲人: Ling Huang DataVisor, Inc
时间: 2016-05-24 14:30-2016-05-24 15:30
地点:FIT 1-222

We present and evaluate a bigdata system for large-scale malware detection that integrates machine learning with expert reviewers, treating reviewers as a limited labeling resource. The system consists of three major components: a big data behavioral analytics platform for malware feature engineering, an ensemble of supervised learning models, a mechanism to obtain feedback from expert reviewers. We demonstrate that even in small numbers, reviewers can vastly improve the system’s ability to keep pace with evolving threats.

We conduct our evaluation on a sample of VirusTotal submissions spanning 2.5 years and containing 1.1 million binaries with 778GB of raw feature data. Without reviewer assistance, we achieve 72% detection at a 0.5% false positive rate, performing comparable to the best vendors on VirusTotal. Given a budget of 80 accurate reviews daily, we improve detection to 89% and are able to detect 42% of malicious binaries undetected upon initial submission to VirusTotal.


Ling Huang is the Director of Data Science at DataVisor, Inc. His research and engineering background are on big data, machine learning, computer vision and security analytics, especially on large-scale machine learning pipelines for user categorization, risk modeling, image processing, natural language processing, fake account/spam/fraud detection, malware classification, etc. Ling Huang was a senior research scientist in affiliate with Intel ISTC on Secure Computing from May 2011 to May 2014, and was a research scientist at Intel Labs Berkeley from October 2007 to May 2011. He pursued his Ph.D. in Computer Science atUniversity of California at Berkeley from 2002 to 2007. During his Ph.D. study, he was affiliated with RadLab.