Efficient Algorithms for Querying Noisy Distributed/Streaming Datasets


PI : Qin Zhang
NSF CCF-1525024 (June 2015 - May 2018)


Abstract

This project aims to study the design of efficient query algorithms for noisy datasets in distributed and streaming applications. Noisy data is universal in today's world. Imprecise and varying references to the same real-world entities are ubiquitous in scientific and commercial databases. This noise poses significant obstructions to accurate data analytics. As an example of "noisy data," consider YouTube videos. YouTube tracks the views of individual videos. However, there are frequently many similar versions of the same event and answering a basic question such as "How many people viewed this event?" is challenging using current techniques. This project will provide new techniques and insights to combat the noisy nature of large datasets, and hence will enhance our ability to process the ever-increasing quantity of business and scientific data. The products of this project will be integrated into a trilogy of graduate and undergraduate courses on algorithms, databases, and data mining. The PI will disseminate research outcomes by giving talks at conferences/workshops, universities, industrial labs, as well as online media.

More technically, this project tries to answer the following question: can we run distributed and streaming algorithms directly on the noisy datasets, resolve the noise "on the fly", and retain communication and space efficiency compared with the noise-free setting? The PI plans to study statistical, relational and graph problems. This project has the potential to impact a wide range of active research areas in theoretical computer science, including distributed and streaming algorithms, group testing, compressed sensing, communication complexity, clustering, and locality sensitive hashing.


Publications

  1. Distinct Sampling on Streaming Data with Near-Duplicates
    with J. Chen
    Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 18), to appear. Houston, TX, U.S.A., June 2018.

  2. Distributed Statistical Estimation of Matrix Products with Applications
    with D. P. Woodruff
    Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 18), to appear. Houston, TX, U.S.A., June 2018.

  3. EmbedJoin: Efficient Edit Similarity Joins via Embeddings
    with H. Zhang
    Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 17) (oral presentation), pages 585-594. Halifax, Nova Scotia, Canada, August 2017.

  4. Distributed Partial Clustering (preliminary full version, 19 pages) [conf. talk]
    with S. Guha and Y. Li
    Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 17), pages 143-152. Washington D.C., U.S.A., July 2017.
    Best Paper Award
    Invited to special issue for SPAA 2017 papers in ACM Transactions on Parallel Computing (TOPC)

  5. Bias-Aware Sketches
    with J. Chen
    Proc. International Conference on Very Large Databases (VLDB 17), pages 961-972. Munich, Germany, August-September 2017.

  6. Communication-Optimal Distributed Clustering
    with J. Chen, H. Sun and D. Woodruff
    Proc. Annual Conference on Neural Information Processing Systems (NIPS 16), pages 3720-3728. Barcelona, Spain, December 2016.

  7. Edit Distance: Sketching, Streaming and Document Exchange (preliminary full version, 30 pages) [conf. talk]
    with D. Belazzougui
    Proc. IEEE Symposium on Foundations of Computer Science (FOCS 16), pages 51-60. New Brunswick, NJ, U.S.A., October, 2016.

  8. Streaming Algorithms for Robust Distinct Elements
    with D. Chen
    Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD 16), pages 1433-1447. San Francisco, CA, U.S.A., June 2016.

  9. On Sketching Quadratic Forms (preliminary full version, 47 pages)
    with A. Andoni, J. Chen, R. Krauthgamer, B. Qin and D. P. Woodruff
    Proc. Innovations in Theoretical Computer Science (ITCS 16), pages 311-319. Cambridge, MA, USA, January 2016.

  10. Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy
    with J. M. Phillips, E. Verbin
    SIAM Journal of Computing (SICOMP), volume 45, issue 1, pages 174-196, February 2016 [Link].

  11. Communication-Efficient Computation on Distributed Noisy Datasets [conf. talk]
    Q. Zhang
    Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 15), pages 313-322. Portland, Oregon, U.S.A., June 2015.

Invited/Conference Talks


Educational and Other Development

A Trilogy of Courses: Code libraries for streaming/distributed algorithms:

Personnel