Publications:
( by dates)
Under Preparation / Submitted
Improved Algorithms For Structured Sparse Recovery. Lingxiao Huang, Yifei Jin, Jian Li, Haitao Wang. [Full version in ArXiv] [ Show Abstract ]
It is known that certain structures of the signal in addition to the standard notion of sparsity (called structured sparsity) can improve the sample complexity in several compressive sensing applications. Recently, Hegde et al. proposed a framework, called approximation-tolerant model-based compressive sensing, for recovering signals with structured sparsity. Their framework requires two oracles, the head- and the tail-approximation projection oracles. The two oracles should return approximate solutions in the model which is closest to the query signal. In this paper, we consider two structured sparsity models and obtain improved projection algorithms. The first one is the tree sparsity model, which captures the support structure in the wavelet decomposition of piecewise-smooth signals. We propose a linear time (1−ϵ)-approximation algorithm for head-approximation projection and a linear time (1+eps)-approximation algorithm for tail-approximation projection. The best previous result is an O(nlogn) time bicriterion approximation algorithm (meaning that their algorithm may return a solution of sparsity larger than k) by Hegde et al. Our result provides an affirmative answer to the open problem mentioned in the survey of Hegde and Indyk. As a corollary, we can recover a constant approximate k-sparse signal. The other is the Constrained Earth Mover Distance (CEMD) model, which is useful to model the situation where the positions of the nonzero coefficients of a signal do not change significantly as a function of spatial (or temporal) locations. We obtain the first single criterion constant factor approximation algorithm for the head-approximation projection. The previous best known algorithm is a bicriterion approximation. Using this result, we can get a faster constant approximation algorithm with fewer measurements for the recovery problem in CEMD model.
Gradient Boosting With Piece-Wise Linear Regression Trees. Yu Shi, Jian Li, Zhize Li. [ArXiv] [ Show Abstract ]
Gradient boosting using decision trees as base learners, so called Gradient Boosted Decision Trees (GBDT), is a very successful ensemble learning algorithm widely used across a variety of applications. Recently, various GDBT construction algorithms and implementation have been designed and heavily optimized in some very popular open sourced toolkits such as Xgboost and LightGBM. In this paper, we show that both the accuracy and efficiency of GBDT can be further enhanced by using more complex base learners. Specifically, we extend gradient boosting to use \textit{piecewise linear regression trees} (PL Trees), instead of \textit{piecewise constant regression trees}. We show PL Trees can accelerate convergence of GBDT. Moreover, our new algorithm fits better to modern computer architectures with powerful Single Instruction Multiple Data (SIMD) parallelism. We propose optimization techniques to speedup our algorithm. The experimental results show that GBDT with PL Trees can provide very competitive testing accuracy with comparable or less training time. Our algorithm also produces much concise tree ensembles, thus can often reduce testing time costs.
Survey
[Survey] Approximation Algorithms for Stochastic Combinatorial Optimization Problems. Jian Li and Yu Liu. Journal of the Operations Research Society of China. (invited article) [paper] [ Show Abstract ]
Stochastic optimization has established itself as a major method to handle uncertainty in various optimization problems, by modeling the uncertainty by a probability distribution over possible realizations. Traditionally, the main focus in stochastic optimization has been various stochastic mathematical programming (such as linear programming, convex programming). In recent years, there has been a surge of interest in stochastic combinatorial optimization problems from the theoretical computer science community. In this article, we survey some of the recent results on various stochastic versions of classical combinatorial optimization problems. Since most problems in this domain are NP-hard (or \#P-hard, or even PSPACE-hard), we focus on the results which provide polynomial time approximation algorithms, with provable approximation guarantees. Our discussions are centered around a few representative problems, such as stochastic knapsack, stochastic matching, multi-armed bandit etc. We use these examples to introduce several popular stochastic models, such as the fixed set model, 2-stage stochastic optimization model, stochastic adaptive probing model etc, as well as some useful techniques for designing approximation algorithms for stochastic combinatorial optimization problems, including the linear programming relaxation approach, boosted sampling, content resolution schemes, Poisson approximation etc. We also provide some open research questions along the way. Our purpose is to provide the readers a quick glimpse to the models, problems and techniques in this area, and hopefully inspire new contributions.
Refereed Conference/Journal Papers
BRITS: Bidirectional Recurrent Imputation for Time Series. Wei Cao, Dong Wang,Jian Li, Hao Zhou, Lei Li, Yitan Li. Thirty-second Conference on Neural Information Processing Systems (NIPS 2018) [paper] [ Show Abstract ]
Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data. We evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity. Experiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies.
A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization. Zhize Li, Jian Li. Thirty-second Conference on Neural Information Processing Systems (NIPS 2018 spotlight) [ArXiv] [ Show Abstract ]
We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. The algorithm is a slight variant of the ProxSVRG algorithm [Reddi et al. NIPS2016]. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results (in terms of the number of stochastic gradient oracle calls and proximal operations), and improves/generalizes some others. In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei at al. NIPS2017] for the smooth nonconvex case. ProxSVRG+ is more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in \cite{reddi2016proximal}. Finally, for nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we show that ProxSVRG+ achieves global linear convergence rate without restart. ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and generalizes the results of SCSG) in this case.
eps-Coresets for Clustering (with Outliers) in Doubling Metrics. Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, Xuan Wu. The 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2018) [Full version in ArXiv] [ Show Abstract ]
We present the first efficient algorithm that constructs an eps-coreset for the a general class of clustering problems in a doubling metric.
To this end, we establish the first relation between the doubling dimension of and the shattering dimension (or VC-dimension) of the range space induced by the distance. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1+ -eps)-distortion of the distance function $d$ (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded. We also introduce the notion of $\tau$-error {\em probabilistic shattering dimension}, and prove a (drastically better) upper bound for the probabilistic shattering dimension for weighted doubling metrics. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications.
Furthermore, we study robust coresets for (k,z)-clustering with outliers in a doubling metric. We show an improved connection between $\alpha$-approximation and robust coreset. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing.
As another application, we show constant-sized (\eps, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.
Optimal Two-Stage Mechanism for Ordinal Peer Assessment. Zhize Li, Le Zhang, Zhixuan Fang and Jian Li. Symposium on Algorithmic Game Theory (SAGT 2018). [Paper] [ Show Abstract ]
A PTAS for a Class of Stochastic Dynamic Programs. Hao Fu, Jian Li and Pan Xu. The 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018) [ArXiv] [ Show Abstract ]
We develop a framework for obtaining polynomial time approximation schemes (PTAS) for a class of stochastic dynamic programs. Using our framework, we obtain the first PTAS for several stochastic combinatorial optimization problems. In particular, we obtain PTAS for the ProbeMax Probemax: We are given a set of n items, each item i has a value Xi which is an independent random variable with a known (discrete) distribution. We can probe a subset P of items sequentially. Each time after probing an item i, we observe its value realization, which follows the distribution of Xi. We can adaptively probe at most m items and each item can be probed at most once. The reward is the maximum among the m realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized. To the best of our knowledge, the best known approximation ratio is 1-1/e, due to Asadpour et al. We also obtain PTAS for several other problems: Committed Pandora¡¯s Box, Stochastic Target, Stochastic Blackjack Knapsack.
Odd Yao-Yao Graphs are Not Spanners. Yifei Jin, Jian Li, Wei Zhan. In 34th International Symposium on Computational Geometry (SoCG 2018). [ArXiv] [ Show Abstract ]
It is a long standing open problem whether Yao-Yao graphs YYk are all spanners. Bauer and Damian showed that all YY6k for k \geq 6 are spanners. Li and Zhan generalized their result and proved that all even Yao-Yao graphs YY2k are spanners (for k \geq 42). However, their technique cannot be extended to odd Yao-Yao graphs, and whether they are spanners are still elusive. In this paper, we show that, surprisingly, for any integer k \geq 1, there exist odd Yao-Yao graph YY2k+1 instances, which are not spanners. This provides a negative answer to (Problem 70) in [http://cs.smith.edu/~orourke/TOPP/P70.html]
SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms. Lingxiao Huang, Yifei Jin and Jian Li. The Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2018). [ArXiv] [ Show Abstract ]
We study two important SVM variants: hard-margin SVM (for linearly separable cases) and v-SVM (for linearly non-separable cases). We propose new algorithms from the perspective of saddle point optimization. To the best of our knowledge, the current best algorithm for v-SVM is based on quadratic programming approach which requires Omega(n^2d) time in worst case~\cite{joachims1998making,platt199912}. In the paper, we provide the first nearly linear time algorithm for v-SVM. Moreover, our algorithms can be implemented in the distributed settings naturally. We prove that our algorithms almost matches the theoretical lower bound of the communication complexity.
Efficient k-Regret Query Algorithm with Restriction-free Bound for any Dimensionality. Min Xie, Raymond Chi-Wing Wong, Jian Li,Cheng Long, Ashwill Lall. 2018 ACM International Conference on Management of Data (SIGMOD 2018). [paper] [ Show Abstract ]
Extracting interesting tuples from a large database is an important problem in multi-criteria decision making. Two representative queries were proposed in the literature: top-k queries and skyline queries. A top-k query requires users to specify their utility functions beforehand and then returns k tuples to the users. A skyline query does not require any utility function from users but it puts no control on the number of tuples returned to users. Recently, a k-regret query was proposed and received attention from the community because it does not require any utility function from users and the output size is controllable, and thus it avoids those defficiencies of top-k queries and skyline queries. Speciffically, it returns k tuples that minimize a criterion called the maximum regret ratio. In this paper, we present the lower bound of the maximum regret ratio for the k-regret query. Besides, we propose a novel algorithm, called Sphere, whose upper bound on the maximum regret ratio is asymptotically optimal and restriction-free for any dimensionality, the best-known result in the literature. We conducted extensive experiments to show that Sphere performs better than the state-of-the-art methods for the k-regret query.
When Will You Arrive? Estimating Travel Time Based on Deep Neural Networks. Dong Wang, Junbo Zhang, Wei Cao, Jian Li, Yu Zheng. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018) [Paper] [Code and Data] [ Show Abstract ]
Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e.g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/ traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation (called DeepTTE) that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies as well. A multi-task learning component is given on the top of DeepTTE, that learns to estimate the travel time of both the entire path and each local path simultaneously during the training phase. Extensive experiments on two trajectory datasets show our DeepTTE significantly outperforms the state-of-the-art methods.
Optimal In-Place Suffix Sorting. Zhize Li, Jian Li, Hongwei Huo. In 25th Symposium on String Processing and Information Retrieval (SPIRE 2018). Extended Abstract in Data Compression Conference (DCC 2018). [Full version in ArXiv] [ Show Abstract ]
The suffix array is a fundamental data structure for many applications that involve string searching and data compression. We obtain the \emph{first} in-place suffix array construction algorithms that are optimal both in time and space for (read-only) integer alphabets. We provide the first linear time in-place algorithm for read-only integer alphabets with |Sigma|=O(n) (i.e., the input string cannot be modified). This algorithm settles the open problem posed by [Franceschini and Muthukrishnan, ICALP'07]. For the read-only general alphabets (i.e., only comparisons are allowed), we present an optimal O(nlogn) time in-place suffix sorting algorithm, recovering the result obtained by Franceschini and Muthukrishnan, which was an open problem posed by [Manzini and Ferragina, ESA'02].
Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, Jie Tang. The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018). [ArXiv] [ Show Abstract ]
Since the invention of word2vec, the skip-gram model has significantly advanced the research of network embedding, such as the recent emergence of the DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all of the aforementioned models with negative sampling can be unified into the matrix factorization framework with closed forms. Our analysis and proofs reveal that: (1) DeepWalk empirically produces a low-rank transformation of a network's normalized Laplacian matrix; (2) LINE, in theory, is a special case of DeepWalk when the size of vertices' context is set to one; (3) As an extension of LINE, PTE can be viewed as the joint factorization of multiple networks' Laplacians; (4) node2vec is factorizing a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk. We further provide the theoretical connections between skip-gram based network embedding algorithms and the theory of graph Laplacian. Finally, we present the NetMF method as well as its approximation algorithm for computing network embedding. Our method offers significant improvements over DeepWalk and LINE for conventional network mining tasks. This work lays the theoretical foundation for skip-gram based network embedding methods, leading to a better understanding of latent network representation learning.
Learning Gradient Descent: Better Generalization and Longer Horizons. Kaifeng Lv, Shunhua Jiang, Jian Li. The 34th International Conference on Machine Learning (ICML 2017). [ArXiv] [ Show Abstract ][Code]
Training deep neural networks is a highly nontrivial task, involving carefully selecting appropriate training algorithms, scheduling step sizes and tuning other hyperparameters. Trying different combinations can be quite labor-intensive and time consuming. Recently, researchers have tried to use deep learning algorithms to exploit the landscape of the loss function of the training problem of interest, and learn how to optimize over it in an automatic way. In this paper, we propose a new learning-to-learn model and some useful and practical tricks. Our optimizer outperforms generic, hand-crafted optimization algorithms and state-of-the-art learning-to-learn optimizers by DeepMind in many tasks. We demonstrate the effectiveness of our algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs.
Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration. Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, Ruosong Wang. In the 30th Annual Conference on Learning Theory (COLT 2017) [Paper] [ Show Abstract ]
We study the combinatorial pure exploration problem CPE in a stochastic multi-armed bandit game. In a CPE instance, we are given $n$ stochastic arms with unknown reward distributions, as well as a family $\subsetfam$ of feasible subsets over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify the feasible subset in $\subsetfam$ with the maximum total weight, using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-$k$ arm identification problem, both of which have attracted significant attention in recent years. We provide a novel \emph{instance-wise} lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of $\ln|\subsetfam|$. For an important class of combinatorial families (including spanning trees, matchings, and path constraints), we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and the notion of approximate Pareto curves in multi-objective optimization (note that $|\subsetfam|$ can be exponential in $n$). We also show that the $\ln|\subsetfam|$ factor is inevitable in general, through a nontrivial lower bound construction utilizing a combinatorial structure resembling the Nisan-Wigderson design. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general \combibandit problem. We further introduce an even more general problem, formulated in geometric terms. We are given $n$ Gaussian arms with unknown means and unit variance. Consider the $n$-dimensional Euclidean space $\mathbb{R}^n$, and a collection $\anssetcol$ of disjoint subsets. Our goal is to determine the subset in $\anssetcol$ that contains the mean profile (which is the $n$-dimensional vector of the means), using as few samples as possible. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.
Towards Instance Optimal Bounds for Best Arm Identification. Lijie Chen, Jian Li, Mingda Qiao. In the 30th Annual Conference on Learning Theory (COLT 2017) [ArXiv] [ Show Abstract ]
We study the best arm identification (BEST-1-ARM) problem, which is defined as follows. We are given n stochastic bandit arms. The ith arm has a reward distribution Di with an unknown mean \mu_i. Upon each play of the ith arm, we can get a reward, sampled i.i.d. from Di. We would like to identify the arm with largest mean with probability at least 1-\delta, using as few samples as possible. We obtain nearly instance optimal sample complexity for best arm, thereby making significant progress towards a complete resolution of the ( gap-entropy conjecture), from both the upper and lower bound sides. The paper is a followup of our early paper: On the Optimal Sample Complexity for Best Arm Identification. Lijie Chen, Jian Li. [ArXiv] The gap entropy conjecture - concerning the instance optimality of the best-1-arm problem. See COLT16 open problem
CDB: A Crowd-Powered Database. Guolinag Li, Chengliang Chai, Ju Fan, Jian Li, Yudian Zheng etc. 2017 ACM International Conference on Management of Data (SIGMOD 2017). [paper][ Show Abstract ]
Crowdsourcing database systems have been proposed to lever- age crowd-powered operations to encapsulate the complex- ities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to se- lect an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which gener- ates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower laten- cy, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization. To address these limitations, we develop a crowd-powered database system CDB that supports crowd-based query op- timizations. CDB has the following fundamental differences compared with the existing systems. Firstly, CDB employs a graph-based query model that provides more fine-grained query optimization. Secondly, CDB adopts a unifieded frame- work to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT and CrowdFlower. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance su- periority of CDB on cost, latency and quality.
Capacitated Center Problems with Two-Sided Bounds and Outliers, Hu Ding, Lunjia Hu, Huang Lingxiao and Jian Li. Workshop on Algorithms and Data Structures (WADS 2017) [ArXiv][ Show Abstract ]
In recent years, the capacitated center problems have at- tracted a lot of research interest. Given a set of vertices V , we want to find a subset of vertices S, called centers, such that the maximum cluster radius is minimized. Moreover, each center in S should satisfy some capacity constraint, which could be an upper or lower bound on the number of vertices it can serve. Capacitated k-center problems with one-sided bounds (upper or lower) have been well studied in previous work, and a constant factor approximation was obtained. We are the first to study the capacitated center problem with both ca- pacity lower and upper bounds (with or without outliers). We assume each vertex has a uniform lower bound and a non-uniform upper bound. For the case of opening exactly k centers, we note that a generaliza- tion of a recent LP approach can achieve constant factor approximation algorithms for our problems. Our main contribution is a simple combi- natorial algorithm for the case where there is no cardinality constraint on the number of open centers. Our combinatorial algorithm is simpler and achieves better constant approximation factor compared to the LP approach.
DeepSD: Supply-Demand Prediction for Online Car-hailing Services using Deep Neural Networks. Dong Wang, Wei Cao, Jian Li, Jieping Ye. The 33th IEEE International Conference on Data Engineering (ICDE 2017). [paper] [ Show Abstract ]
The online car-hailing service has gained great popularity all over the world. As more passengers and more drivers use the service, it becomes increasingly more important for the the car-hailing service providers to effectively schedule the drivers to minimize the waiting time of passengers and maximize the driver utilization, thus to improve the overall user experience. In this paper, we study the problem of predicting the real-time car-hailing supply-demand, which is one of the most important component of an effective scheduling system. Our objective is to predict the gap between the car-hailing supply and demand in a certain area in the next few minutes. Based on the prediction, we can balance the supply-demands by scheduling the drivers in advance. We present an end-to-end framework called Deep Supply-Demand (DeepSD) using a novel deep neural network structure. Our approach can automatically discover complicated supply-demand patterns from the car-hailing service data while only requires a minimal amount hand-crafted features. Moreover, our framework is highly flexible and extendable. Based on our framework, it is very easy to utilize multiple data sources (e.g., car-hailing orders, weather and traffic data) to achieve a high accuracy. We conduct extensive experimental evaluations, which show that our framework provides more accurate prediction results than the existing methods.
Cleaning Relations using Knowledge Bases. Shuang Hao, Nan Tang, Guoliang Li, Jian Li. The 33th IEEE International Conference on Data Engineering (ICDE 2017). Journal version in VLDB Journal 2018 [paper] [ Show Abstract ] [Journal link]
We study the data cleaning problem of detecting and repairing wrong relational data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that, a DR simultaneously models two opposite semantics of a relation using types and relationships in a KB: the positive semantics that explains how attribute values are linked to each other in correct tuples, and the negative semantics that indicates how wrong attribute values are connected to other correct attribute values within the same tuples. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule generation and rule consistency. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.
Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection. Lijie Chen, Jian Li, Mingda Qiao. The 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). [full version in ArXiv] [ Show Abstract ]
In the Best-k-Arm problem, we are given n stochastic bandit arms, each associated with an unknown reward distribution. We are required to identify the k arms with the largest means by taking as few samples as possible. In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best-k-Arm problem. On the lower bound side, we obtain a novel complexity term to measure the sample complexity that every Best-k-Arm instance requires. This is derived by an interesting and nontrivial reduction from the Best-1-Arm problem. We also provide an elimination based algorithm that matches the instancewise lower bound within doubly-logarithmic factors. The sample complexity of our algorithm is strictly better than the state-of-the-art for Best-k-Arm (module constant factors).
k-Regret Minimizing Set: Efficient Algorithms and Hardness. Wei Cao, Jian Li, Haitao Wang, Kangning Wang, Ruosong Wang, Raymond Chi-Wing Wong and Wei Zhan. The 20th International Conference on Database Theory (ICDT 2017), Venice, Italy. (best newcomer award) [paper] [full version] [ Show Abstract ]
We study the k-regret minimizing query (k-RMS), which is a useful operator for supporting multi-criteria decision-making. Given two integers k and r, a k-RMS returns r tuples from the database which minimize the k-regret ratio, defined as one minus the worst ratio between the k-th maximum utility score among all tuples in the database and the maximum utility score of the r tuples returned. A solution set contains only r tuples, enjoying the benefits of both top-k queries and skyline queries. Proposed in 2012, the query has been studied extensively in recent years. In this paper, we advance the theory and the practice of k-RMS in the following aspects. First, we develop efficient algorithms for k-RMS (and its decision version) when the dimensionality is 2. The running time of our algorithms outperforms those of previous ones. Our experimental results show that our algorithms are more efficient than previous ones on both synthetic and real datasets up to three orders of magnitude. Second, we show that k-RMS is NP-hard even when the dimensionality is 3. This provides a complete characterization of the complexity of k-RMS, and answers an open question in previous studies. In addition, we present approximation algorithms for the problem when the dimensionality is 3 or larger.
Stochastic k-Center and j-Flat-Center Problems. Lingxiao Huang, Jian Li. ACM-SIAM Symposium on Discrete Algorithms (SODA17). [ArXiv] [ Show Abstract ]
Solving geometric optimization problems over uncertain data have become increasingly important in many applications and have attracted a lot of attentions in recent years. In this paper, we study two important geometric optimization problems, the k-center problem and the j-flat-center problem, over stochastic/uncertain data points in Euclidean spaces. For the stochastic k-center problem, we would like to find k points in a fixed dimensional Euclidean space, such that the expected value of the k-center objective is minimized. For the stochastic j-flat-center problem, we seek a j-flat (i.e., a j-dimensional affine subspace) such that the expected value of the maximum distance from any point to the j-flat is minimized. We consider both problems under two popular stochastic geometric models, the existential uncertainty model, where the existence of each point may be uncertain, and the locational uncertainty model, where the location of each point may be uncertain. We provide the first PTAS (Polynomial Time Approximation Scheme) for both problems under the two models. Our results generalize the previous results for stochastic minimum enclosing ball and stochastic enclosing cylinder.
Combinatorial Multi-Armed Bandit with General Reward Functions, Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, Pinyan Lu. Neural Information Processing Systems (NIPS), 2016. [Full version in ArXiv] [ Show Abstract ]
In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that if the underlying variables have known finite supports, SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. For general arbitrary distributions, we further use a discretization technique and show an $\tilde{O}(\sqrt{T})$ regret bound. We apply our results to the $K$-MAX problem and the expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt{T})$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon > 0$.
Demand Driven Store Placement via Multiple Spatial-temporal Data. Mengwen Xu, Tianyi Wang, Zhengwei Wu, Jingbo Zhou, Jian Li and Haishan Wu. ACM SIGSPATIAL 2016. [paper][ Show Abstract ]
Choosing a good location when opening a new store is crucial for the future success of a business. Traditional methods in- clude online manual survey, analytic models based on census data, which are either unable to adapt to the dynamic mar- ket or very time consuming. The rapid increase of the avail- ability of big data from various types of mobile devices, such as online query data and online positioning data, provides us with the possibility to develop automatic and accurate data-driven prediction models for business store placemen- t. In this paper, we propose a Demand Distribution Driven Store Placement (D3SP) framework for business store place- ment by mining search query data from Baidu Maps. D3SP detects the spatial-temporal distributions of customer demands on different business services via query data from Baidu Maps, the largest online map search engine in China, and detects the gaps between demand and supply. Then we determine candidate locations via clustering such gaps. In the final stage, we solve the location optimization problem by predicting and ranking the number of customers. We not only deploy supervised regression models to predict the num- ber of customers, but also use learn-to-rank model to directly rank the locations. We evaluate our framework on various types of businesses in real-world cases, and the experiments results demonstrate the effectiveness of our methods. D3SP as the core function for store placement has already been implemented as a core component of our business analyt- ics platform and could be potentially used by chain store merchants on Baidu Nuomi.
DESTPRE : A Data-Driven Approach to Destination Prediction for Taxi Rides. Mengwen Xu, Dong Wang, Jian Li. UbiComp 2016. [ Paper ] [ Show Abstract ]
With the wide use of mobile devices, predicting the destination of moving vehicles has become an increasingly important problem for location based recommendation systems and destination-based advertising. Most existing approaches are based on various Markov chain models, in which the historical trajectories are used to train the model and the top-k most probable destinations are returned. We identify certain limitations of the previous approaches. Instead, we propose a new data-driven framework, called DESTPRE, which is not based on a probabilistic model, but directly operates on the trajectories and makes the prediction. We make use of only historic trajectories, without individual identity information. Our design of DESTPRE, although simple, is a result of several useful observations from the real trajectory data. DESTPRE involves an index based on Bucket PR Quadtree and Minwise hashing, for efficiently retrieving similar trajectories, and a clustering on destinations for predictions. By incorporating some additional ideas, we show that the prediction accuracy can be further improved. We have conducted extensive experiments on real Beijing Taxi dataset. The experimental results demonstrate the effectiveness of DESTPRE.
epsilon-Kernel Coresets for Stochastic Points. Lingxiao Huang, Jian Li, Jeff Phillips and Haitao Wang. The 24rd Annual European Symposium on Algorithms (ESA 2016). [ArXiv] [ Show Abstract ]
We consider the standard stochastic geometry model in which the existence or the location of each point may be stochastic. We show there exists constant-size kernel coreset (a kernel can approximate either the expected value or the distribution of the width of the point set in every direction) and we can construct such kernel coresets in nearly linear time. We show its applications to several function extent problems, tracking moving points, and shape fitting problems in the stochastic setting.
Almost All Even Yao-Yao Graphs Are Spanners. Jian Li, Wei Zhan. The 24rd Annual European Symposium on Algorithms (ESA 2016). [ArXiv] [ Show Abstract ]
It is an open problem whether Yao-Yao graphs YYk (also known as sparse-Yao graphs) are all spanners when the integer parameter k is large enough. In this paper we show that, for any integer k\geq 42, the Yao-Yao graph YY2k is a tk-spanner, with stretch factor tk=4.27+O(k−1) when k tends to infinity. Our result generalizes the best known result which asserts that all YY6k are spanners for k large enough [Bauer and Damian, SODA'13]. Our proof is also somewhat simpler.
Pure Exploration of Multi-armed Bandit Under Matroid Constraints. Lijie Chen, Anupum Gupta, Jian Li. Conference on Learning Theory (COLT 2016). [Full version] [ Show Abstract ]
We study the pure exploration problem subject to a matroid constraint (BEST-BASIS) in a stochastic multi-armed bandit game. In a BEST-BASIS instance, we are given n stochastic arms with unknown reward distributions, as well as a matroid M over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify a basis of M with the maximum total weight, using as few samples as possible. The problem is a significant generalization of the best arm identification problem and the top-k arm identification problem, which have attracted significant attentions in recent years. We study both the exact and PAC versions of BEST-BASIS, and provide algorithms with nearly-optimal sample complexities for these versions. Our results generalize and/or improve on several previous results for the top-k arm identification problem and the combinatorial pure exploration problem (when the combinatorial constraint is a matroid).
K-Means Clustering with Distributed Dimension. Hu Ding, Yu Liu, Lingxiao Huang, Jian Li. The 33rd International Conference on Machine Learning (ICML 2016). [Paper] [Supplementary] [ Show Abstract ]
Distributed clustering has attracted significant attention in recent years. In this paper, we study the k-means problem in the distributed dimension setting, where the dimensions of the data are partitioned across multiple machines. We provide new approximation algorithms, which incur low communication costs and achieve constant approximation factors. The communication complexity of our algorithms significantly improve on existing algorithms. We also provide the first communication lower bound, which nearly matches our upper bound in a wide range of parameter setting. Our experimental results show that our algorithms outperform existing algorithms on real data sets in the distributed dimension setting.
Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach, Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, Jianhua Feng. The annual ACM SIGMOD conference 2016. Journal version in VLDB Journal 2018. [Paper] [ Show Abstract ][Journal link]
Crowdsourced entity resolution has recently attracted a significant attention because it can harness the wisdom of crowds to improve the quality of entity resolution. However existing techniques either cannot achieve perfect quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution framework, which can significantly reduce the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next we iteratively select pairs without answers to ask until all pairs have answers. We devise effective algorithms to judiciously select the pairs to ask in order to reduce the number of asked pairs. To further reduce the cost, we propose a grouping technique to group the pairs such that we only ask one pair instead of all pairs in each group. We develop error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd. Experimental results show that our method reduces the cost to 1.25\% of existing approaches (or existing approaches take more than 80 times money of our method) while not sacrificing the quality.
ETCPS: An Effective and Scalable Traffic Condition Prediction System, Dong Wang, Wei Cao, Mengwen Xu and Jian Li The 21st International Conference on Database Systems for Advanced Applications (DASFAA 2016). [paper] [ Show Abstract ]
Real-time prediction of the traffic condition is an important ingredient for a variety of applications. In this paper, we propose an Ensemble based Traffic Condition Prediction System (ETCPS) for predicting the traffic conditions of any roads in a city based on the current and historical GPS data collected from floating vehicles. We have observed two useful correlations in the traffic condition time series, which are the bases of our design. In order to exploit these two correlations for prediction, we propose two different models called Predictive Regression Tree (PR-Tree) and Spatial Temporal Probabilistic Graphical Model (STPGM). Our best quality prediction is achieved by a careful ensemble of the two models. Our system provides high-quality prediction and can easily scale to very large datasets. We conduct extensive experimental evaluations with a large GPS data set collected from more than 12,000 taxis in Beijing during two months. The experimental results demonstrate the effectiveness, efficiency, and scalability of our system.
Automatic User Identification Method across Heterogeneous Mobility Data Souces. Wei Cao, Zhengwei Wu, Jian Li, Haishan Wu. In International Conference on Data Engineering (ICDE2016). [ Show Abstract ]
On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs. Wei Cao, Jian Li, Yufei Tao, Zhize Li. Neural Information Processing Systems (NIPS), 2015. [full paper] [ Show Abstract ]
This paper discusses how to efficiently choose from n unknown distributions the k ones whose means are the greatest by a certain metric, up to a small relative error. We study the topic under two standard settingsâ€”multi-armed bandits and hidden bipartite graphs. For both settings, we prove lower bounds on the total number of samples needed, and propose optimal algorithms whose sample complexities match those lower bounds.
Stochastic Online Greedy Learning with Semi-bandit Feedbacks. Tian Lin, Jian Li, Wei Chen. Neural Information Processing Systems (NIPS), 2015. [full paper] [ Show Abstract ]
We study the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the greedy regret and eps-quasi greedy regret as learning metrics comparing with the performance of offline greedy algorithm. We propose two online greedy learning algorithms with semi-bandit feedbacks, which achieve O(log T) problem-dependent regret bound for a general class of combinatorial structures and reward functions that allow greedy solutions.
Efficient Algorithms for One-Dimensional k-Center Problems. Danny Z. Chen, Jian Li, Haitao Wang. Theoretical Computer Science (TCS), 2015 [ArXiv] [ Show Abstract ]
We consider the problem of finding k centers for n weighted points on a real line. This (weighted) k-center problem was solved in O(n log n) time previously by using Cole's parametric search and other complicated approaches. In this paper, we present an simpler O(n log n) time algorithm that avoids the parametric search, and in certain special cases our algorithm solves the problem in O(n) time.
A PTAS for the Weighted Unit Disk Cover Problem, Jian Li, Yifei Jin. The 42nd International Colloquium on Automata, Languages, and Programming (ICALP 2015) [ArXiv] [ Show Abstract ]
We are given a set of weighted unit disks and a set of points in Euclidean plane. The minimum weight unit disk cover (\UDC) problem asks for a subset of disks of minimum total weight that covers all given points. For the weighted \UDC\ problem, several constant approximations have been developed. However, whether the problem admits a PTAS has been an open question. We present the first PTAS for \UDC. Our result implies the first PTAS for the minimum weight dominating set problem in unit disk graphs, the first PTAS for the maximum lifetime coverage problem and an improved constant approximation ratio for the connected dominating set problem in unit disk graphs.
Approximating the Expected Values for Combinatorial Optimization Problems over Stochastic Points. Lingxiao Huang, Jian Li. The 42nd International Colloquium on Automata, Languages, and Programming (ICALP 2015). [ArXiv] This version subsumes a previous draft: Minimum Spanning Trees, Perfect Matchings and Cycle Covers Over Stochastic Points in Metric Spaces. [ Show Abstract ]
We consider the stochastic geometry model where the location of each node is a random point in a given metric space, or the existence of each node is uncertain. We study the problems of computing the expected lengths of several combinatorial or geometric optimization problems over stochastic points, including closest pair, minimum spanning tree, k-clustering, minimum perfect matching, and minimum cycle cover. We also consider the problem of estimating the probability that the length of the closest pair, or the diameter, is at most, or at least, a given threshold. Most of the above problems are known to be #P-hard. We obtain FPRAS for most of them in both the existential and the locational uncertainty models.
Learning Arbitrary Statistical Mixtures of Discrete Distributions. Jian Li, Yuval Rabani, Leonard J. Schulman, Chaitanya Swamy, In ACM Symposium on the Theory of Computing (STOC 2015). [ArXiv] [ Show Abstract ]
We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, the model to be learned, \mix, is a probability distribution over probability distributions p, where each such p is a probability distribution over [n] ={1,2,...,n}. When we sample from \mix, we do not observe p directly, but only indirectly and in very noisy fashion, by sampling from [n] repeatedly, independently K times from the distribution p. The problem is to infer \mix to high accuracy in transportation (earthmover) distance. We give the first efficient algorithms for learning this mixture model without making any restricting assumptions on the structure of the distribution. We bound the quality of the solution as a function of the size of the samples K and the number of samples used. Our model and results have applications to a variety of unsupervised learning scenarios, including learning topic models and collaborative filtering.
Efficient Similarity Search and Join on Multi-Attribute Data. Guoliang Li, Jian He, Deng Dong, Jian Li, Jianhua Feng. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2015).
Approximation Algorithms for the Connected Sensor Cover Problem. Lingxiao Huang, Jian Li, Qicai Shi. The 21st Annual International Computing and Combinatorics Conference (COCOON'15) (in the conference version, the construction of the steiner tree LP and lemma 4 have some problems. please read the new version here) [paper] [ Show Abstract ]
We are given a set of sensors and a set of target points in the Euclidean plane. In MIN-ConnectSensorCover, our goal is to find a set of sensors of minimum cardinality, such that all target points are covered, and all sensors can communicate with each other (i.e., the communication graph is connected). In Budgeted-CSC problem, our goal is to choose a set of B sensors, such that the number of targets covered by the chosen sensors is maximized and the communication graph is connected. We obtain constant approximations for both problems.
Linear Time Approximation Schemes for Geometric Maximum Coverage, Jian Li, Haitao Wang, Bowei Zhang, Ningye Zhang. The 21st Annual International Computing and Combinatorics Conference (COCOON'15). [conference version] [ Show Abstract ] [full version] (full version authors: Kai Jin, Jian Li, Haitao Wang, Bowei Zhang, Ningye Zhang.)
We study approximation algorithms for the following geometric version of the maximum coverage problem: Let P be a set of n weighted point in the plane. We would like to place m rectangles such that total weights of covered points in P is maximized. For any m, we present linear time approximation schemes that can find a 1+eps approximation to the optimal solution.
Balanced Splitting on Weighted Intervals. Wei Cao, Jian Li, Shimin Li, and Haitao Wang. Operation Research Letters, 2015 (accepted).
Range Queries on Uncertain Data, Jian Li, Haitao Wang International Symposium on Algorithms and Computation (ISAAC 2014). Journal version accepted in TCS, 2015 [arXiv]. [ Show Abstract ]
We are given a set of stochastic points on the real line, we consider the problem of building data structures on P to answer range queries: find the top-k points that lie in the given interval with the highest probability.
On the Energy Efficiency of Device Discovery in Mobile Opportunistic Networks: A Systematic Approach. Bo Han, Jian Li, Aravind Srinivasan. IEEE Transactions on Mobile Computing (TMC), 2014 [Paper]
Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing, Yuan Zhou, Xi Chen, Jian Li. International Conference on Machine Learning. ICML 2014. [full version] [ Show Abstract ]
We study the problem of selecting K arms with the highest expected rewards in a stochastic N-armed bandit game. We propose a new PAC algorithm, which, with probability at least 1-delta , identifies a set of K arms with average reward at most \epsilon away from the top-k arm. Naive uniform sampling requires O(nlnn) samples. We show it is possible to achieve linear sample complexity. We also establish a matching lower bound (meaning our upper bound is worse-case optimal).
The Multi-shop Ski Rental Problem, Lingqing Ai, Xian Wu, Lingxiao Huang, Longbo Huang, Pingzhong Tang, and Jian Li, Proceedings of ACM SIGMETRICS, 2014. [paper]
We consider the multi-shop ski rental problem. This problem generalizes the classic ski rental problem to a multi-shop setting, in which each shop has different prices for renting and purchasing a pair of skis, and a consumer has to make decisions on when and where to buy. We given optimal close-form online competitive strategy from the consumer's perspective and a linear time algorithm for computing the optimal strategy. The problem finds applications in cost management in IaaS cloud and scheduling in distributed computing.
Fully Polynomial Approximation Scheme for Approximating a Sum of Random Variables, Jian Li and Tianlin Shi. In Operation Research Letters (ORL), 2014 [ArXiv] [Code (by Tianlin)] [ Show Abstract ]
We show there is an FPTAS for approximating Pr[âˆ‘X_iâ‰?] for a set of independent (not necessarily identically distributed) random variables X_i.
A Pruned Exhaustive Search Algorithm for Nearly Optimal Diversified Result Ranking, Fei Chen, Yiqun Liu, Jian Li, Min Zhang and Shaoping Ma, 23rd International World Wide Web Conference, Seoul, Korea, April 7-11, 2014 (Poster)
Egalitarian Pairwise Kidney Exchange: Fast Algorithms via Linear Programming and Parametric Flow. Jian Li, Yicheng Liu, Lingxiao Huang, Pingzhong Tang. In the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014) [paper] [ Show Abstract ]
We obtain an efficient poly-time algorithm for the pairwise kidney exchange problem proposed by Roth et al. [J. Econ.Th. 05]. Their original algorithm runs in exponential time. We also provide an alternative short proof of the fact that there exists a majorizing vector in a polymatroid (original proof due to Tamir [MOR95]).
A Constant Factor Approximation Algorithm for Fault-Tolerant k-Median. Mohammadtaghi Hajiaghayi, Wei Hu, Jian Li, Shi Li, Barna Saha. In the ACM-SIAM Symposium on Discrete Algorithms(SODA 2014), Portland, Oregon, USA. Journal version accepted by ACM Transcations on Algorithms, 2016. [ArXiv] [ Show Abstract ]
(1) We present the first constant approximation for the fault-tolerant k-median problem even when the demands are nonuniform (each demand point should be assigned to a given number of facilities). This generalizes a previous constant approximation for the uniform case and improves on a previous logarithmic approximation for the general nonuniform case. (2) We present the first constant approximation for the fault-tolerant facility location problem even when the weight function are non-monotone. This generalizes a previous constant approximation for the monotone case.
Optimal Allocation for Chunked-Reward Advertising. Weihao Kong, Jian Li, Tao Qin, Tie-Yan Liu. In The 9th Conference on Web and Internet Economics (WINE 2013), Cambridge, MA, USA [ArXiv]
Ranking with Diverse Intents and Correlated Contents. Jian Li and Zeyu Zhang. In the 33rd Foundations of Software Technology and Theoretical Computer Science conference (FSTTCS 2013), IIT Guwahati, India.[ArXiv]
Your Friends Have More Friends Than You Do: Identifying Influential Mobile Users Through Random Walks. Bo Han, Jian Li and Aravind Srinivasan. IEEE/ACM Transactions on Networking (TON), 2013 [Paper] [ Show Abstract ]
we investigate the problem of identifying influential users in mobile social networks. Influential users are individuals with high centrality in their social-contact graphs. Traditional approaches find these users through centralized algorithms. However, the computational complexity of these algorithms is known to be very high, making them unsuitable for large-scale networks. We propose a lightweight and distributed protocol, iWander, to identify influential users through fixed length random-walk sampling. We prove that random-walk sampling with O(log n) steps, where n is the number of nodes in a graph, comes quite close to sampling vertices approximately according to their degrees. The most attractive feature of iWander is its extremely low control-message overhead, which lends itself well to mobile applications.
Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases. Dong Deng, Yu Jiang, Guoliang Li, Jian Li, Cong Yu. In the 39th International Conference on Very Large Data Bases (VLDB 2013), Italy, 2013 [Paper][Full version].
Trinary-Projection Trees for Approximate Nearest Neighbor Search. Jingdong Wang, Naiyan Wang, You Jia; Jian Li, Gang Zeng, Hongbin Zha, Xian-Sheng Hua. The IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013 [Paper]
Stochastic Combinatorial Optimization via Poisson Approximation. Jian Li and Wen Yuan. In the 45th ACM Symposium on the Theory of Computing (STOC 2013), USA,2013 [Paper] [Full version in ArXiv] [ Show Abstract ]
We propose a new technique, called Poisson approximation, for approximating/optimizing the sum of a set of random variables. In many cases, it reduces the stochastic problem to a deterministic multidimensional packing/covering problem. We can apply it to several problems, such as (1) Expected utility maximization (we can reproduce and generalize one result in our FOCS11 paper.) (2) Stochastic knapsack (proposed in [Dean et al. MOR08]): we are given a set items. The size and/or the profit of an item may not be fixed values and only their probability distributions are known to us in advance. The actual size and profit of an item are revealed to us as soon as it is inserted into the knapsack. We want to find a policy to insert the items such that the total expected profit is maximized. We obtain a bicriterion PTAS (i.e., 1+eps approximation with 1+eps capacity), even we allow correlation and cancellation of jobs. (3) Bayesian Online Selection with knapsack constraint: it is a variant of stochastic knapsack where the size and the profit of an item are revealed before the decision whether to select the item is made.
Matroid and Knapsack Center Problems. Danny Z. Chen, Jian Li, Hongyu Liang, and Haitao Wang. In The 16th Conference on Integer Programming and Combinatorial Optimization (IPCO 2013), Chile, 2013 [Paper] [Full version in ArXiv]. Journal version accepted in Algorithmica, 2015. [ Show Abstract ]
We present the first constant approximation for the fault-tolerant k-median problem (each demand point should be assigned to a given number of facilities).
Algorithms on Minimizing the Maximum Sensor Movement for Barrier Coverage of a Linear Domain; Danny Z. Chen, Yan Gu, Jian Li and Haitao Wang; In 13th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2012), Helsinki, Finland, 2012 [ArXiv]. Journal version in Discrete Computational Geometry (DCG), 2013 [Journal doi] [ Show Abstract ]
Given a set of small intervals, and a target interval I, we would like to move the small intervals to cover the target interval. Our goal is to minimize the maximum movement of any interval. We obtain an O(n^2 log n) algorithm for this problem.
The load-distance balancing problem. Edward Bortnikov, Samir Khuller, Jian Li, Yishay Mansour and Seffi Naor. Networks, 2012. [Paper] [doi] [ Show Abstract ]
We study a problem referred to as the load-distance balancing (LDB) problem, where the objective is assigning a set of clients to a set of given servers. Each client suffers a delay, that is, the sum of the network delay (which is proportional to the distance to its server) and the congestion delay at this server, a nondecreasing function of the number of clients assigned to the server. We address two flavors of LDBâ€”the first one seeking to minimize the maximum incurred delay, and the second one targeted for minimizing the average delay. We also provide bounds for the price of anarchy for the game theoretic version of the problem.
Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems; Jian Li, and Amol Deshpande; In Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer Science (FOCS 2011), Palm Springs, California, 2011. [Paper] [Full Version in ArXiv]. [ Show Abstract ]
We consider the problem of maximizing the expected utility E[u(X(S))] where X(S)=\sum_{i \in S}X_i and S is a feasible set for some combinatorial optimization problems (such as shortest path, minimum spanning tree, knapsack, etc.). (1) We present an additive PTAS for bounded Lipshitz utility function. This has implications in the VaR problem such as Pr[X(S)\leq 1](which corresponds to a step utility function). (2) We give PTAS for increasing concave functions (which are widely used to model risk-averse behaviors) and increasing function with bounded derivative.
DataSynth: Generating Synthetic Data using Declarative Constraints. Arvind Arasu, Raghav Kaushik, and Jian Li. In the 37th International Conference on Very Large Data Bases (VLDB 2011), Seattle, Wasington, 2011. (Demo)
Data Generation using Declarative Constraints. Arvind Arasu, Raghav Kaushik, and Jian Li. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2011), Athens, Greece, 2011. [Paper] [ Show Abstract ]
We study the problem of generating synthetic databases having declaratively specified characteristics. This problem is motivated by database system and application testing, data masking, and benchmarking. We argue that a natural, expressive, and declarative mechanism for specifying data characteristics is through cardinality constraints; a cardinality constraint specifies that the output of a query over the generated database have a certain cardinality. While the data generation problem is in tractable in general, we present efficient algorithms that can handle a large and useful class of constraints.
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases. Bhargav Kanagal, Jian Li, Amol Deshpande. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2011), Athens, Greece, 2011. [Paper] [ Show Abstract ]
We extend our VLDB09 paper to continuous distributions. We give exact algorithm for polynomial PDFs and approximation algorithms for arbitrary continuous distributions using techniques in approximation theory, such as splines, quadratures, with convergence guarantees better than naive Monte Carlo.
Generalized Machine Activation Problems. Jian Li and Samir Khuller. In the ACM-SIAM Symposium on Discrete Algorithms (SODA 2011), San Francisco, USA, 2011. [Paper][slides] [ Show Abstract ]
We obtain tight approximation results for the machine activation problem proposed in our SODA10 work. We also obtain the first lnn-approximation for the universal facility location problem in non-metric spaces.
Ranking Continuous Probabilistic Datasets. Jian Li and Amol Deshpande. In the 36th International Conference on Very Large Data Bases (VLDB 2010), Singapore, 2010. [Paper] [slides] [ Show Abstract ]
We extend our VLDB09 paper to continuous distributions. We give exact algorithm for polynomial PDFs and approximation algorithms for arbitrary continuous distributions using techniques in approximation theory, such as splines, quadratures, with convergence guarantees better than naive Monte Carlo.
Densest $k$-Subgraph Approximation on Intersection Graphs. Danny Z. Chen, Rudolf Fleischer, Jian Li. In the 8th Workshop on Approximation and Online Algorithms (WAOA 2010). [Paper][slides] [ Show Abstract ]
We provide constant factor approximation algorithm for the densest k-subgraph problem on several graph classes, such as chordal graphs, disk graphs etc.
New Models and Algorithms for Throughput Maximization in Broadcast Scheduling. Chandra Chekuri, Avigdor Gal, Sungjin Im, Samir Khuller, Jian Li, Richard McCutchen, Benjamin Moseley, Louiqa Raschid. In the 8th Workshop on Approximation and Online Algorithms (WAOA 2010). [slides] [full version]
When LP is the Cure for Your Matching Woes: Improved Bounds for Stochastic Matchings. Nikhil Bansal, Anupam Gupta, Jian Li, Julian Mestre, Viswanath Nagarajan, Atri Rudra. In the 18th Annual European Symposium on Algorithms (ESA 2010). (Best Paper Award) [Paper] [Slides] Journal version in Algorithimca, 2011.[Journal Version] [ Show Abstract ]
We study the stochastic matching problem, which finds applications in kidney exchange, online dating and online ads. Consider a random graph model where each possible edge e is present independently with some probability pe. We are given these numbers pe, and want to build a large/heavy matching in the randomly generated graph. However, the only way we can find out whether an edge is present or not is to query it, and if the edge is indeed present in the graph, we are forced to add it to our matching. Further, each vertex i is allowed to be queried at most ti times. How should we adaptively query the edges to maximize the expected weight of the matching? Our main result is the first constant approximation for the weighted stochastic matching.
Clustering with Diversity. Jian Li, Ke Yi, Qin Zhang. In the 37th International Colloquium on Automata, Languages and Programming (ICALP 2010),July 5-10, 2010. [full version in arXiv] [ Show Abstract ]
We consider the clustering with diversity problem: given a set of colored points in a metric space, partition them into clusters such that each cluster has at least ell points, all of which have distinct colors. We give a 2-approximation to this problem for any ell when the objective is to minimize the maximum radius of any cluster. We show that the approximation ratio is optimal unless P\ne NP by providing a matching lower bound. Several extensions to our algorithm have also been developed for handling outliers. This problem can be considered as a metric variant of the l-diversity problem, a popular problem for privacy-preserving data publication
On Computing Compression Trees for Data Collection in Wireless Sensor Networks. Jian Li, Amol Deshpande and Samir Khuller. In the 29th Conference on Computer Communications (INFOCOM 2010), San Diego, USA, 2010 [Paper] [slides]
Energy Efficient Scheduling via Partial Shutdown. Samir Khuller, Jian Li, Barna Saha. In the ACM-SIAM Symposium on Discrete Algorithms (SODA 2010), Austin, USA, 2010. [Paper] [ Show Abstract ]
The central framework we introduce considers a collection of m machines (unrelated or related) with each machine i having an activation cost of ai. There is also a collection of n jobs that need to be performed, and pi,j is the processing time of job j on machine i. Standard scheduling models assume that the set of machines is fixed and all machines are available. However, in our setting, we assume that there is an activation cost budget of A â€?we would like to select a subset S of the machines to activate with total cost a(S) A and find a schedule for the n jobs on the machines in S minimizing the makespan (or any other metric).
A Unified Approach to Ranking in Probabilistic Databases. Jian Li, Barna Saha and Amol Deshpande. In the 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, 2009. (Best Paper Award) [Paper] [Slides long short] Journal version: The VLDB Journal, 2011. [Journal Version] [ Show Abstract ]
We propose two parameterized ranking functions, called PRF and PRFe, for ranking probabilistic datasets. PRF and PRFe generalize or can approximate many of the previously proposed ranking functions. We present novel generating functions-based algorithms for efficiently ranking large datasets according to these ranking functions
Consensus Answers for Queries over Probabilistic Databases. Jian Li and Amol Deshpande. In the 28th ACM Symposium on Principles of Database Systems (PODS 2009). Providence, USA, 2009 [Paper][slides] [ Show Abstract ]
We address the problem of finding a â€œbestâ€?deterministic query answer to a query over a probabilistic database. We propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). We consider this problem for various types of queries including SPJ queries, Top-k ranking queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness).
Minimizing communication cost in distributed multi-query processing. Jian Li, Amol Deshpande and Samir Khuller. In International Conference on Data Engineering (ICDE2009), Shanghai, China, 2009 [Paper] [slides] [ Show Abstract ]
We are also given a set of queries, Q1.....Qm, with the query Qi requiring access to a subset of relations distributed in different nodes of the network. For each query, a query plan is provided, in the form of a rooted tree, which specifies the operations to be performed on the relations and the order in which to perform them. Given this input, our goal is to find a data movement plan that minimizes the total communication cost incurred while executing the queries. We provide efficient exact algorithm when the communication network is a tree and approximation algorithm for more general networks.
An $O({logn\over loglogn})$ Upper Bound on the Price of Stability for Undirected Shapley Network Design Games. Jian Li.In Information Processing Letter (IPL). 2009. [Paper][slides] [ Show Abstract ]
We have an edge weighted undirected network G(V, E) and n selfish players where player i wants to choose a low cost path from source vertex si to destination vertex ti . The cost of each edge is equally split among players who pass it. The price of stability is defined as the ratio of the cost of the best Nash equilibrium to that of the optimal solution. We present an O(logn/ log logn) upper bound on price of stability for the single sink case, i.e., ti =t for all i.
More Efficient Algorithms and Analyses for Unequal Letter Cost Prefix-Free Coding. Mordecai Golin, Jian Li. In Proceedings of the 18th International Symposium on Algorithms and Computation (ISAAC 2007), Sendai, Japan, 2007. Journal version In IEEE Transactions on Information Theory, Volume 54, Issue 8, Aug. Page(s):3412 - 3424, 2008 [Journal Version] [ Show Abstract ]
There is a large literature
devoted to the problem of
finding an optimal (min-cost) prefix-free
code with an unequal
letter-cost encoding alphabet of size. While
there is no known
polynomial time algorithm for solving it
optimally, there are many
good heuristics that all provide additive
errors to optimal. The additive error in these algorithms usually
depends linearly upon
the largest encoding letter size.
This paper was motivated by the problem of
finding optimal codes
when the encoding alphabet is infinite. Because
the largest letter
cost is infinite, the previous analyses could
give infinite error
bounds. We provide a new algorithm that works
with infinite encoding
alphabets. When restricted to the finite
alphabet case, our
algorithm often provides better error bounds
than the best previous
ones known.
Approximating the Maximum Sharing Problem. Amitabh Chaudhary, Danny Z. Chen, Rudolf Fleischer, Jian Li, Xiaobo S. Hu, Michael T. Niemier, Zhiyi Xie, Hong Zhu. In Proceedings of 10th Workshop on Algorithms and Data Structures(WADS 2007) Halifax, Canada, August 15-17, 2007. [Paper]
Algorithms for core stability, core largeness, exactness, and extendability of flow games. Qizhi Fang, R. Fleischer, Jian Li, and Xiaoxun Sun. In Proceedings of the 13th Annual International Computing and Combinatorics Conference (COCOON 2007), Banff, Canada, July 16-19, 2007. Journal version in Frontier of Math in China, vol 5(1), pp.47-63, 2010. [Journal Version](This journal paper combines a previous paper on this topic and our cocoon07 paper. We fixed some bugs. Now, the thms and pfs should be correct.) [ Show Abstract ]
We study core stability and some related properties of flow games defined on simple networks (all edge capacities are equal) from an algorithmic point of view. We first present a sufficient and necessary condition that can be tested efficiently for a simple flow game to have a stable core. We also prove the equivalence of the properties of core largeness, extendability, and exactness of simple flow games and provide an equivalent graph theoretic characterization which allows us to decide these properties in polynomial time.
Efficient Algorithms for $k$-Disjoint Paths Problems on DAGs. Rudolf Fleischer, Qi Ge, Jian Li, Hong Zhu. In Proceedings of the 3nd International Symposium on Algorithmic Aspects in Information and Management (AAIM 2007), Portland, USA, 2007.[Paper]
On approximating the maximum simple sharing problem. Danny Z. Chen, R. Fleischer, Jian Li, Zhiyi Xie, and Hong Zhu. In Proceedings of the 17th International Symposium on Algorithms and Computation (ISAAC 2006), Kolkata, December 18-20, 2006. [slides][Paper]
Traversing the machining graph. Danny Z. Chen, R. Fleischer, Jian Li, Haitao Wang, and Hong Zhu. In Proceedings of the 14th Annual International Symposium on Algorithms (ESA 2006), Zurich, Switzerland, Sep 11-13, 2006. Springer LNCS. [full version] [ Show Abstract ]
We obtain the first poly-time exact algorithm for traversing the machining graph in simple pockets, which improves upon a previous constant approximation. We also present a PTAS for general pockets with holes.
Non-metric multicommodity and multilevel facility location. R. Fleischer, Jian Li, Shijun Tian, and Hong Zhu. In Proceedings of the 2nd International Symposium on Algorithmic Aspects in Information and Management (AAIM 2006), Hong Kong, Jun 20-22, 2006. Springer LNCS 4041, 2006, pp. 138-148. [Paper]
Approximating spanning trees with inner nodes cost. R. Fleischer, Qi Ge, Jian Li, Shijun Tian, and Haitao Wang. In Proceedings of the 6th International Conference on Parallel and Distributed Computing (PDCAT 2005), Dalian, China, Dec 5-8, 2005, pp. 600-664. [full version] [ Show Abstract ]
We study the problem of finding a minimum spanning tree with both edge weights and inner node (non-leaf node) weights.