The Web and online social media provide invaluable communication services to a global Internet user base. The tremendous success of these services, however, has also created valuable opportunities for criminals and other miscreants to abuse them for their own gain. As a result, it is both an important yet challenging problem to detect, monitor, and curtail this abuse. However, the large scale and diversity of these services, combined with the tactics used by attackers, make it difficult to discern one clear and robust signal for detecting abuse. One approach, relying on domain expertise, is to construct a small set of well-crafted heuristics, but such heuristics tend to rapidly become obsolete. In this talk, I will describe more robust approaches based on machine learning, statistical modeling, and large-scale analytics of large data sets.
First I will describe online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. This application is particularly appropriate for online algorithms as the size of the training data is larger than can be
efficiently processed in batch and because the features that typify malicious URLs evolve continuously. Motivated by this application, we built a real-time system to gather URL features and analyze them against a source of labeled URLs from a large Web mail provider. Our system adapts in an online fashion to the evolving characteristics of malicious URLs, achieving daily classification accuracies up to 99% over a balanced data set.
Justin Ma is a postdoc in the UC Berkeley AMPLab. His primary research is in systems security, and his other interests include applications of machine learning to systems problems, systems for large-scale machine learning, and the impact of energy availability on computing. He received B.S. degrees in Computer Science and Mathematics from the University of Maryland in 2004, and he received his Ph.D. in Computer Science from UC San Diego in 2010.