This talk is about index and data model to support efficient and effective keyword search, from a micro level (fundamental-operator level), e.g., index for processing fast set intersection, to a macro level (system level), e.g., optimizing index to support taxonomy keyword queries.
In the first part of the talk, I will discuss algorithms to improve the performance of a key operation in keyword search for both structured and unstructured data: the efficient computation of set intersections. Here, I will introduce compact data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we show how to compute their intersection in expected time O(n/sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. Subsequently, we will discuss a much more practical version of this algorithm, which has weaker asymptotic guarantees but is much simpler and performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.
In the second part, I will introduce how to optimize (inverted) index to support efficiently processing a keyword query whose substitutes are defined by a given taxonomy of words (terms). This problem is challenging because each term in a query can have a large number of substitutes and the original query can be rewritten into any of their combinations. We propose to build an additional index (besides inverted index) to efficiently process queries. For a query workload, we formulate an optimization problem which chooses the additional index structure, aiming at minimizing the query evaluation cost, under given index space constraints. We show the NP-hardness of the problem, and propose a pseudo-polynomial time algorithm using dynamic programming, as well as an (1−1/e)/4-approximation algorithm to solve the problem. Experimental results show that, with only 10% additional index space, our approach can greatly reduce the query evaluation cost.Finally, I will briefly introduce our Text Cube-TopCell project which aims to support keyword queries and text analytics in structured data with both structural dimensions and text dimensions (or attributes). In the text cube model, text data is aggregated on different subsets of the structural dimensions, forming a hierarchy of entities called cube cells. Given a keyword query, our goal is to find the top-k most relevant cells in the text cube, which correspond to hidden entities in different granularities. Efficient algorithms to find relevant cube cells are proposed and a real system is being implemented (supported by NASA).