Messages

  • July. 5, 2018: Webpage created

Course summary

In this course we will talk about sublinear algorithms, which has its roots in the study of Big Data that occur more and more frequently in various applications, e.g., analyses of financial transactions, internet traffic, social networks, genome sequences, etc. Concretely, we will talk about:
1. Sublinear space algorithms. In particular, data stream algorithms, namely, algorithms that solve a problem by making one pass over the data set while using small memory. These algorithms are important in many application areas such as databases and networking, where data arrives at a high speed and there is no time and/or need to store it for offline processing.
2. Sublinear time algorithms, that is, algorithms that do not even read the whole input when outputting the answers.
3. Sublinear communication algorithms. The data is stored in multiple machines, who want to jointly compute functions defined on the union of the data sets via communication.
4. Random topics.
Participants are expected to have a good background in algorithm design and probability, and have good programming skills.
The evaluation will be based on homework assignments and individual project/presentation. The list of questions will be handed out in the middle of the course.
Detailed list of topics is available in the course plan below.

Lecturer

Qin Zhang
Email: qzhangcs@indiana.edu
Office hours: Wed. 4-5pm at Luddy 3128

Associate Instructor:
TBD
Email: TBD
Office hours: TBD

Time and place

2:30pm - 3:45pm Monday/Wednesday
Luddy Hall 002.

Textbooks

  • There is no textbook for the class.
    Lectures are based on this notes by Chakrabarti, and recent papers.

  • Background on Randomized Algorithms:
    • [MR] Randomized Algorithms by Motwani and Raghavan
    • [MU] Probability and Computing by Mitzenmacher and Upfal
  • Similar courses:

    Course schedule

     Week   Date   Section   Content   Literature   Slides   Comments 
      1   Aug. 20     Part 0:  
      Introduction  
      New models for Big Data     slides
      1   Aug. 22         New models for Big Data (cont.)     same
      2   Aug. 27     Interesting problems     slides    
      2   Aug. 29     Basic probabilistic tools   slides   Read Chapter 3, 4 of [MU] 
      3   Sep. 3   Labor day. No class
      3   Sep. 5   Part 1:  
      Sublinear in space  
      Poiny Queries       slides   Read Lecture 1 of these notes 
      4   Sep. 10       Misra-Gries, Space-Saving     [MAA]   same   Space-saving in this paper 
      4   Sep. 12       Space-Saving, FM-sketch     [FM]   same   Read Lecture 2 of these notes 
      5   Sep. 17     Improvement on FM sketch   [BYJKST]     same   Read Section 3 of these notes
      5   Sep. 19   Linear sketch     same   Read these slides
      6   Sep. 24   Linear sketch       same   Read these slides
      6   Sep. 26   Count-Min, Count-sketch   [CM] [CCF]     same   Read Section 4 of these notes
      7   Oct. 1   Alternative for L2 point query   [GKMS]   same   Read these slides
      7   Oct. 3     L0 sampling   [JST]   same   Read these slides
      8   Oct. 8     Presentation I        

    Grading

    The final grade will be curved.

    • Assignments 40%

      There will be two homework assignments. Assignments will be posted in the middle of the course. The answers should be typeset in LaTeX and submitted via Canvas; here is a template to start with.

    • Projects 60%

      There are two programming assignments. One for topics in Part 1 of this course, and one for topics in Part 2 and Part 3. For each assignment one needs to pick a problem and implement relevant algorithms taught in class, and then make a presentation in class. At the end of this course, one needs to pick one of the two assignments and write a report.

      Each programming assignment + presentation counts 20% of the grade. The final report counts another 20%.

      Here is a list of datasets that you can use:
      Stanford Large Network Dataset Collection
      US Census
      Data dumps
      UCI datasets
      SuiteSparse Matrix Collection
      Yahoo Webscope Program
      StatLib

    Course policies

    For assignments, students may discuss answers with anyone, including problem approaches and proofs. But all students must write their own proofs, and write-ups. The names of all people that you have talked to should be listed at the beginning of the first page. If a solution comes from existing papers/web/books, they must be properly cited, and you must write the solution in a way that demonstrates your understanding (simply copying the solution will be considered as plagiarism). All deadlines are firm. No late assignments will be accepted unless there are legitimate circumstances.
    For projects, you may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged. Any outside materials used must be referenced appropriately.
    For more details, see Indiana University Code of Student Rights, Responsibilities, and Conduct.

    Prerequisites

      One is expected to know basics on algorithm design and analysis as well as probability. E.g., have taken B403 ``Introduction to Algorithm Design and Analysis" or equivalent courses.