清华大学交叉信息研究院

Supporting Efficient and Effective Keyword Search: From Set Intersection to Taxo

演讲人： Bolin Ding 伊利诺伊大学厄本那—香槟分校
时间： 2012-03-08 10:00-2012-03-08 11:00
地点：FIT 1-222
课件下载：点击下载
内容：

This talk is about index and data model to support efficient and effective keyword search, from a micro level (fundamental-operator level), e.g., index for processing fast set intersection, to a macro level (system level), e.g., optimizing index to support taxonomy keyword queries.

In the first part of the talk, I will discuss algorithms to improve the performance of a key operation in keyword search for both structured and unstructured data: the efficient computation of set intersections. Here, I will introduce compact data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we show how to compute their intersection in expected time O(n/sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. Subsequently, we will discuss a much more practical version of this algorithm, which has weaker asymptotic guarantees but is much simpler and performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

In the second part, I will introduce how to optimize (inverted) index to support efficiently processing a keyword query whose substitutes are defined by a given taxonomy of words (terms). This problem is challenging because each term in a query can have a large number of substitutes and the original query can be rewritten into any of their combinations. We propose to build an additional index (besides inverted index) to efficiently process queries. For a query workload, we formulate an optimization problem which chooses the additional index structure, aiming at minimizing the query evaluation cost, under given index space constraints. We show the NP-hardness of the problem, and propose a pseudo-polynomial time algorithm using dynamic programming, as well as an (1−1/e)/4-approximation algorithm to solve the problem. Experimental results show that, with only 10% additional index space, our approach can greatly reduce the query evaluation cost.

Finally, I will briefly introduce our Text Cube-TopCell project which aims to support keyword queries and text analytics in structured data with both structural dimensions and text dimensions (or attributes). In the text cube model, text data is aggregated on different subsets of the structural dimensions, forming a hierarchy of entities called cube cells. Given a keyword query, our goal is to find the top-k most relevant cells in the text cube, which correspond to hidden entities in different granularities. Efficient algorithms to find relevant cube cells are proposed and a real system is being implemented (supported by NASA).

个人简介:

Bolin Ding is a 5th year Ph.D. student of Computer Science at University of Illinois at Urbana-Champaign (UIUC), where he is working in the Data Mining Group under the supervision of Dr. Jiawei Han. Prior to joining UIUC, he received his M.Phil. degree from The Chinese University of Hong Kong in 2007, and Bachelor's degree from Renmin University of China in 2005. His research interests center around database systems and data mining. Specifically, he is interested in problems on the management and analysis of massive structured data, including efficient and effective searching, indexing, data mining algorithms, and data privacy. His research has led to over 30 research papers in high-profile international conferences and journals, and won the best student paper award in the International Conference on Data Engineering in 2007 (ICDE'07)