Title: Priority sampling to estimating arbitrary subset sums
Speaker: Mikkel Thorup AT&T Labs--Research, Shannon Laboratory
Time: 2007-06-07 10:40-2007-06-07 11:25
Venue: Room 6A - 201, Tsinghua University

Abstract:

    From a high volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. 
     Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight sensitive sampling scheme without replacement that is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes 
     Priority sampling is simple to define and implement: we consider a steam of items i=0,...,n-1 with weights w_i. For each item i, we generate a random number r_i in (0,1) and create a priority q_i=w_i/r_i. The sample S consists of the k highest priority items. Let t be the (k+1)th highest priority. Each sampled item i in S gets a weight estimate W_i=max{w_i,t}, while non-sampled items get weight estimate W_i=0. 
     Magically, it turns out that the weight estimates are unbiased, that is, E[W_i]=w_i, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and surpricingly, there is no co-variance between different weight estimates W_i and W_j. 
     We conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased estimators that gets smaller total variance than priority sampling with k+1 items. Recently Szegedy settled this conjecture.



Short Bio:

    Mikkel Thorup is a Lead Member of Technical Staff at AT&T Labs-Research where he has been since 1998. He holds a PhD from Oxford University from 1993. From 1993 to 1998 he was at the faculty of University of Copenhagen. Thorup's main work is in Algorithms and Data Structures and he is the editor of this area for Journal of the ACM. Currently he also serves on the editorial boards of SIAM Journal on Computing, ACM Transactions on Algorithms, and the open access journal Theory of Computing. Thorup has more than a hundred refereed publications and he is a co-inventor of the Smart Sampling Technologies that lie at the hart of AT&T's Scaleable Traffic Analysis Service.      Thorup is a Fellow of the ACM, a Member of the Royal Danish Academy of Sciences, and an Adjunct Full Professor at the University of Copenhagen.