Computer Science Department,
Indiana University Bloomington,
Luddy Hall, RM 3044,
700 North Woodlawn Avenue,
Bloomington, IN 47408-3901, USA
Before joining IU, I spent a couple of great years at Theory Group, IBM Almaden Research Center,
and Center for Massive Data Algorithmics, Aarhus University.
I obtained my PhD at Department of Computer Science and Engineering, HKUST.
[Home] [CV] [Publication] [Activities]
Our Algorithm Group is recruiting!
For prospective students: If you are interested in working with me as a PhD student,
please send an email to email@example.com. See here for the recruit advertisement.
I am organizing Theory Seminar. If you wish to give a talk at our seminar, or to join our mailing list please contact me.
The 67th Midwest Theory Day at IU Bloomington (April 15-16, 2017).
Theoretical Foundations for Big Data: communication-efficient distributed computation/monitoring, streaming/sketching algorithms, lower bounds.
Applied Areas: algorithms for fundamental problems in databases, data mining, machine learning and bioinformatics.
- Fall 2018: B669 Sublinear Algorithms for Big Data, at IUB
- Spring 2018: B403 Introduction to Algorithm Design and Analysis, at IUB
- Fall 2017: B669 Sublinear Algorithms for Big Data, at IUB
- Spring 2017: B403 Introduction to Algorithm Design and Analysis, at IUB
- Fall 2015: B561 Advanced Database Concepts, at IUB
- Fall 2014: B561 Advanced Database Concepts, at IUB
- Spring 2014: B490 Mining the Big Data, at IUB
- Fall 2013: B669 Sublinear Algorithms for Big Data, at IUB
- Summer 2012: Linear Sketch and Its Applications in Data Streams and Compressive Sensing, at HKUST
- Spring 2011, Q4: Streaming Algorithms, at Aarhus University
- Program Committee:
NIPS 2016 ,
Some Papers [ Full List ][ DBLP ]
Collaborative Learning with Limited Interaction:
Tight Bounds for Distributed Exploration in Multi-Armed Bandits (preliminary full version, 34 pages)
with C. Tao and Y. Zhou
- We give a systematic study of the best arm identification problem in the setting of collaborative learning with limited interaction.
- We obtain almost tight round-speedup tradeoffs for both fixed-time and fixed-confidence settings, and show a complexity separation for the two variants.
- We develop two new techniques for proving round lower bounds for multi-agent collaborative learning.
Distributed Partial Clustering (preliminary full version, 20 pages) [conf. talk]
with S. Guha and Y. Li
Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 17), pages 143-152. Washington D.C., U.S.A., July 2017.
Invited to special issue for SPAA 2017 papers in ACM Transactions on Parallel Computing (TOPC)
Best Paper Award
- We give the first systematic study of communication/round-efficient distributed partial clustering, by providing almost tight bounds for partial k-center, k-median, and k-means in both deterministic and stochastic settings.
- As a byproduct, we develop the first algorithms for the partial k-median and k-means objectives that run in subquadratic time.
This follow-up work (NIPS 18) gives a more practical algorithm for distributed clustering with outliers.
Edit Distance: Sketching, Streaming and Document Exchange (preliminary full version, 30 pages) [conf. talk]
with D. Belazzougui
Proc. IEEE Symposium on Foundations of Computer Science (FOCS 16), pages 51-60. New Brunswick, NJ, U.S.A., October, 2016.
- We achieve nearly optimal communication with almost linear time recovery for document exchange.
- We obtain the first exact sketching and streaming algorithms for edit distance with sublinear size/space under small distance thresholds.
- Our technique of aligning strings using multiple random walks may be of independent interest.
Communication-Efficient Computation on Distributed Noisy Datasets [conf. talk]
Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 15), pages 313-322. Portland, Oregon, U.S.A., June 2015.
- We give a first theoretical investigation to the general question: How can we process distributed noisy datasets communication-efficiently?.
This paper (SIGMOD 16) and this paper (PODS 18) extend the study to the data stream model.
The Communication Complexity of Distributed Set-Joins with Applications to Matrix Multiplication
with D. Van Gucht, R. Williams and D. P. Woodruff
Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 15), pages 199-212. Melbourne, VIC, Australia, May-June 2015.
- We settle the communication complexity of a set of basic distributed set-join operations.
- Our new algorithmic techniques also lead to an improved algorithm for computing transitive closure of a directed graph in terms of running time.
Subspace Embeddings and Lp Regression Using Exponential Random Variables (preliminary full version, 27 pages) [conf. talk]
with D. P. Woodruff
Proc. Conference on Learning Theory (COLT 13), pages 546-567. Princeton, NJ, U.S.A., June 2013.
- We propose a unified algorithmic framework that improves the running time of Lp-regression for all p = [1, ∞)\2.
Tight Bounds for Distributed Functional Monitoring (preliminary full version, 50 pages) [conf. talk]
with D. P. Woodruff
Proc. ACM Symposium on Theory of Computing (STOC 12), pages 941-960. New York, NY, U.S.A., May 2012.
- We resolve several fundamental questions in the area of distributed functional monitoring (a.k.a. distributed streaming).
- Surprisingly, the total communication required to keep monitoring a function is often similar to the corresponding one-shot computation!
- A new technique called "composition" is proposed for randomized multiparty communication complexity.
This follow-up work (SODA 14) resolves the communication complexity of distinct elements in all parameters.
Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy (preliminary full version, 22 pages) [conf. talk]
with J. M. Phillips and E. Verbin
Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA 12), pages 486-501. Kyoto, Japan, January 2012.
Journal version in SIAM Journal of Computing (SICOMP), volume 45, issue 1, pages 174-196, February 2016
- A new technique called "symmetrization" is proposed for randomized multiparty communication complexity.
Optimal Sampling From Distributed Streams (preliminary full version, 25 pages) [talk]
with G. Cormode, S. Muthukrishnan and K. Yi
Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 10), pages 77-86. Indianapolis, IN, U.S.A., June 2010.
Invited to Journal of the ACM (JACM), volume 59, issue 2, pages 10:1-10:25, April 2012 [Link].
- In the distributed streaming model, random sampling is even easier than counting.
The Limits of Buffering: A Tight Lower Bound for Dynamic Membership in the External Memory Model (preliminary full version, 18 pages) [talk]
with E. Verbin
Proc. ACM Symposium on Theory of Computing (STOC 10), pages 447-456. Cambridge, MA, U.S.A., June 2010.
Journal version in SIAM Journal of Computing (SICOMP), volume 42, issue 1, pages 212-229, January 2013
- We resolve several fundamental problems in the area of external memory data structures.
- For many basic problems, buffering is impossible to achieve in the external memory model with sublogarithmic query time.
- The external memory model is clearner than the RAM model in certain perspectives.
Multi-Dimensional Online Tracking (preliminary full version, 18 pages) [conf. talk]
with K. Yi
Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA 09), pages 1098-1107. New York, NY, U.S.A., January 2009.
Journal version in ACM Transactions on Algorithms (TALG), volume 8, issue 2, pages 12:1-12:16, April 2012 [Link].
Note: In all papers above, authors are ordered alphabetically.
- A new and very natural class of online problems is studied.