Computer Science Department,
Indiana University Bloomington,
Luddy Hall, RM 3044,
700 North Woodlawn Avenue,
Bloomington, IN 47408-3901, USA
Before joining IU, I spent a couple of great years at Theory Group, IBM Almaden Research Center,
and Center for Massive Data Algorithmics, Aarhus University.
I obtained my PhD at Department of Computer Science and Engineering, HKUST.
[Home] [CV] [Publication] [Activities]
We are recruiting! For prospective students: If you are interested in working with me as a PhD student, in Algorithms, Databases, and/or Machine Learning,
please send an email to firstname.lastname@example.org. Fully funded RA positions available.
Internship for international undergraduate students. Please contact me first by email.
I am organizing Theory Seminar. If you wish to give a talk at our seminar, or to join our mailing list please contact me.
The 67th Midwest Theory Day at IU Bloomington (April 15-16, 2017).
Theoretical Foundations for Big Data: communication-efficient distributed computation/monitoring, streaming/sketching algorithms, lower bounds.
Applied Areas: algorithms for fundamental problems in databases, data mining, machine learning and bioinformatics.
- Program Committees (selected):
- Yan Song (PhD candidate since 2018)
- Boli Fang (PhD candidate since 2017; co-advising with Prof. David Crandall)
- Nikolai Karpov (PhD candidate since 2017; co-advising with Prof. Cenk Sahinalp)
- Chao Tao (PhD candidate since 2015; co-advising with Prof. Yuan Zhou)
- Haoyu Zhang (PhD candidate since 2015)
- Jiecao Chen (PhD 2019; first employment Google AI)
Some Papers [ Full List ][ DBLP ]
Collaborative Learning with Limited Interaction:
Tight Bounds for Distributed Exploration in Multi-Armed Bandits (preliminary full version, 37 pages) [conf. talk]
with C. Tao and Y. Zhou
Proc. IEEE Symposium on Foundations of Computer Science (FOCS 19), to appear. Baltimore, MD, U.S.A., November 2019.
- We conduct a systematic study of the best arm identification problem in the setting of collaborative learning with limited interaction.
- We obtain almost tight round-speedup tradeoffs for both fixed-time and fixed-confidence settings, and show a complexity separation for the two variants.
- We develop two new techniques for proving round lower bounds for multi-agent collaborative learning.
Distributed Partial Clustering (preliminary full version, 20 pages) [conf. talk]
with S. Guha and Y. Li
Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 17), pages 143-152. Washington D.C., U.S.A., July 2017.
Invited to special issue for SPAA 2017 papers in ACM Transactions on Parallel Computing (TOPC)
Best Paper Award
- We give the first systematic study of communication/round-efficient distributed partial clustering, by providing almost tight bounds for partial k-center, k-median, and k-means in both deterministic and stochastic settings.
- As a byproduct, we develop the first algorithms for the partial k-median and k-means objectives that run in subquadratic time.
This follow-up work (NIPS 18) gives a more practical algorithm for distributed clustering with outliers.
Edit Distance: Sketching, Streaming and Document Exchange (preliminary full version, 30 pages) [conf. talk]
with D. Belazzougui
Proc. IEEE Symposium on Foundations of Computer Science (FOCS 16), pages 51-60. New Brunswick, NJ, U.S.A., October, 2016.
- We achieve nearly optimal communication with almost linear time recovery for document exchange.
- We obtain the first exact sketching and streaming algorithms for edit distance with sublinear size/space under small distance thresholds.
- Our technique of aligning strings using multiple random walks may be of independent interest.
Subspace Embeddings and Lp Regression Using Exponential Random Variables (preliminary full version, 27 pages) [conf. talk]
with D. P. Woodruff
Proc. Conference on Learning Theory (COLT 13), pages 546-567. Princeton, NJ, U.S.A., June 2013.
- We propose a unified algorithmic framework that improves the running time of Lp-regression for all p = [1, ∞)\2.
Tight Bounds for Distributed Functional Monitoring (preliminary full version, 50 pages) [conf. talk]
with D. P. Woodruff
Proc. ACM Symposium on Theory of Computing (STOC 12), pages 941-960. New York, NY, U.S.A., May 2012.
- We resolve several fundamental questions in the area of distributed functional monitoring (a.k.a. distributed streaming).
- Surprisingly, the total communication required to keep monitoring a function is often similar to the corresponding one-shot computation!
- A new technique called "composition" is proposed for randomized multiparty communication complexity.
This follow-up work (SODA 14) resolves the communication complexity of distinct elements in all parameters.
Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy (preliminary full version, 22 pages) [conf. talk]
with J. M. Phillips and E. Verbin
Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA 12), pages 486-501. Kyoto, Japan, January 2012.
Journal version in SIAM Journal of Computing (SICOMP), volume 45, issue 1, pages 174-196, February 2016
- A new technique called "symmetrization" is proposed for randomized multiparty communication complexity.
Optimal Sampling From Distributed Streams (preliminary full version, 25 pages) [talk]
with G. Cormode, S. Muthukrishnan and K. Yi
Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 10), pages 77-86. Indianapolis, IN, U.S.A., June 2010.
Invited to Journal of the ACM (JACM), volume 59, issue 2, pages 10:1-10:25, April 2012 [Link].
- In the distributed streaming model, random sampling is even easier than counting.
The Limits of Buffering: A Tight Lower Bound for Dynamic Membership in the External Memory Model (preliminary full version, 18 pages) [talk]
with E. Verbin
Proc. ACM Symposium on Theory of Computing (STOC 10), pages 447-456. Cambridge, MA, U.S.A., June 2010.
Journal version in SIAM Journal of Computing (SICOMP), volume 42, issue 1, pages 212-229, January 2013
Note: In all papers above, authors are ordered alphabetically.
- We resolve several fundamental problems in the area of external memory data structures.
- For many basic problems, buffering is impossible to achieve in the external memory model with sublogarithmic query time.
- The external memory model is clearner than the RAM model in certain perspectives.