PI : Qin Zhang
NSF CCF-1525024 (June 2015 - May 2019)
This project aims to study the design of efficient query algorithms for noisy datasets in distributed and streaming applications. Noisy data is universal in today's world. Imprecise and varying references to the same real-world entities are ubiquitous in scientific and commercial databases. This noise poses significant obstructions to accurate data analytics. As an example of "noisy data," consider YouTube videos. YouTube tracks the views of individual videos. However, there are frequently many similar versions of the same event and answering a basic question such as "How many people viewed this event?" is challenging using current techniques. This project will provide new techniques and insights to combat the noisy nature of large datasets, and hence will enhance our ability to process the ever-increasing quantity of business and scientific data. The products of this project will be integrated into a trilogy of graduate and undergraduate courses on algorithms, databases, and data mining. The PI will disseminate research outcomes by giving talks at conferences/workshops, universities, industrial labs, as well as online media.
More technically, this project tries to answer the following question: can we run distributed and streaming algorithms directly on the noisy datasets, resolve the noise "on the fly", and retain communication and space efficiency compared with the noise-free setting? The PI plans to study statistical, relational and graph problems. This project has the potential to impact a wide range of active research areas in theoretical computer science, including distributed and streaming algorithms, group testing, compressed sensing, communication complexity, clustering, and locality sensitive hashing.
A Practical Algorithm for Distributed Clustering and Outlier Detection
with J. Chen and E. Sadeqi-Azer
Proc. Annual Conference on Neural Information Processing Systems (NeurIPS 18), pages 2253-2262. Montreal, Canada, December 2018.
Smooth q-Gram, and Its Applications to Detection of Overlaps among Long, Error-Prone Sequencing Reads
H. Zhang, Q. Zhang and H. Tang (by contribution)
Proc. ACM International Conference on Information and Knowledge Management (CIKM 18), pages 267-276. Turin, Italy, October 2018.
Distinct Sampling on Streaming Data with Near-Duplicates
with J. Chen
Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 18), pages 369-382. Houston, TX, U.S.A., June 2018.
Distributed Statistical Estimation of Matrix Products with Applications
with D. P. Woodruff
Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 18), pages 383-394. Houston, TX, U.S.A., June 2018.
EmbedJoin: Efficient Edit Similarity Joins via Embeddings
with H. Zhang
Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 17) (oral presentation), pages 585-594. Halifax, Nova Scotia, Canada, August 2017.
Distributed Partial Clustering (preliminary full version, 19 pages) [conf. talk]
with S. Guha and Y. Li
Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 17), pages 143-152. Washington D.C., U.S.A., July 2017.
Best Paper Award
Invited to special issue for SPAA 2017 papers in ACM Transactions on Parallel Computing (TOPC)
with J. Chen
Proc. International Conference on Very Large Databases (VLDB 17), pages 961-972. Munich, Germany, August-September 2017.
Communication-Optimal Distributed Clustering
with J. Chen, H. Sun and D. Woodruff
Proc. Annual Conference on Neural Information Processing Systems (NIPS 16), pages 3720-3728. Barcelona, Spain, December 2016.
Edit Distance: Sketching, Streaming and Document Exchange (preliminary full version, 30 pages) [conf. talk]
with D. Belazzougui
Proc. IEEE Symposium on Foundations of Computer Science (FOCS 16), pages 51-60. New Brunswick, NJ, U.S.A., October, 2016.
Streaming Algorithms for Robust Distinct Elements
with D. Chen
Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD 16), pages 1433-1447. San Francisco, CA, U.S.A., June 2016.
On Sketching Quadratic Forms (preliminary full version, 47 pages)
with A. Andoni, J. Chen, R. Krauthgamer, B. Qin and D. P. Woodruff
Proc. Innovations in Theoretical Computer Science (ITCS 16), pages 311-319. Cambridge, MA, USA, January 2016.
Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy
with J. M. Phillips, E. Verbin
SIAM Journal of Computing (SICOMP), volume 45, issue 1, pages 174-196, February 2016
Communication-Efficient Computation on Distributed Noisy Datasets [conf. talk]
Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 15), pages 313-322. Portland, Oregon, U.S.A., June 2015.
- Efficient Algorithms for Streaming Datasets with Near-Duplicates
At Theory and Applications of Hashing , Dagstuhl Seminar, Germany. May 2017.
- Some New Questions in Communication Complexity
At Communication Complexity and Applications, II , Banff, Alberta, Canada. March 2017.
- The Communication Complexity of Distributed Set-Joins
At Computer Science Colloquium, University of Houston, Houston, TX, USA. November 2016.
- Edit Distance: Sketching, Streaming and Document Exchange
At Aarhus University , Aarhus, Denmark. September 2016.
And at FOCS 16, New Brunswick, NJ, U.S.A., October, 2016.
- Lower Bound Techniques for Multiparty Communication Complexity
At Nexus of Information and Computation Theories , Paris, France. February 2016.
- Streaming Algorithms for Robust Distinct Elements
At Workshop on Multi-dimensional Proximity Problems, University of Maryland, College Park, MD, USA. January 2016.
- Communication-Efficient Computation on Distributed Noisy Datasets.
At SPAA 15, Portland, OR, USA. June, 2015
And at Sublinear AlgorithmsWorkshop, Johns Hopkins University, Baltimore, MD, USA. January 2016.
Educational and Other Development
A Trilogy of Courses:
Code libraries for streaming/distributed algorithms: