In this course we will discuss stateoftheart algorithmic tools for data mining, including:
1. Finding Similar items (e.g., Locality Sensitive Hashing)
2. Mining Frequent items (e.g., CountMin Sketch)
3. Clustering (e.g., Spectral Clustering)
4. Link Analysis (e.g., PageRank)
We will also discuss how to implement solutions in modern computational models for Big Data, including:
1. Data streams
2. MapReduce
3. ActiveDHT (DHT: distributed hash table)
4. Distributed monitoring*
Detailed list of topics is available in the course schedule below.
This is an advanced undergraduate course. Participants are expected to have a background in basic probability, bigO analysis, and programming (no specific language requirement).
The evaluation will be based on a set of exercises and individual project/presentation.
Qin Zhang
Email: qzhangcs@indiana.edu
Office hours: By email appointment
AI: Prasanth Velamala
Email: prasvela@umail.iu.edu
Office hours: Thursdays, 2pm3pm, LH112
4:00PM5:15PM Mon, Wed
Lindley Hall, Room 008.
Week  Date  Section  Content  Slides  Comments 

1  Jan. 13  0. Introduction  Big Data, Data Mining, Problems and Computational Models, Course Plan 
slides  
1  Jan. 15  Basic Probability  same  A note  
2  Jan. 22  Basic Probability (cont.)  same  same  
3  Jan. 27  1. Finding Similar Items  Jaccard Similarty MinHashing 
slides  Phillips' note1 note2 
3  Jan. 29  Locality Sensitive Hashing (LSH)  same  Phillips' note  
4  Feb. 3  LSH (cont.)  same  
4  Feb. 5  LSH (cont.)  same  
5  Feb. 10  Distance Functions  same  Phillips' note  
5  Feb. 12  2. Clustering  Hierachical Clustering  slides  Phillips' note 
6  Feb. 17  Assignmentbased Clustering: kcenter  same  Some proofs  
6  Feb. 19  kmean  same  Phillips' note  
7  Feb. 24  kmedian  same  
7  Feb. 26  Spectural Clustering  same  Phillips' note  
8  Mar. 3  Clustering Social Networks  same  
8  Mar. 5  3. Mining Frequent Items  Frequent Items in Data Stream MisraGries 
slides  
9  Mar. 10  Spacesaving  same  
9  Mar. 12  CountMin Compressive Sensing 
same  
10  Mar. 17  Spring break, no class  
10  Mar. 19  Spring break, no class  
11  Mar. 24  Frequent Itemsets  same  Phillips' note  
11  Mar. 26  4. Link Analysis  Markov Chain Basics  slides  Phillips' note 
12  Mar. 31  Class cancelled, instructor out of the town 

12  Apr. 2  Metropolis Algorithm, Webpage Similarity  same  
13  Apr. 7  PageRank, HITS, Salsa  same  Leskovec's slides  
13  Apr. 9  5. Models for Big Data  MapReduce, PageRank Implementation 
slides  Phillips' slides (modified) 
14  Apr. 14  ActiveDHT, PageRank Implementation  same  Stoica's slides about DHT Goel's slides about Active DHT 

14  April. 16  6. Student Presentations  Chen (Online Advertising), Chen (LSH for NN Search), Jazayeri (Clustering for Finding Communities) 

15  April. 21  Anderson, Biggs, Damico  
15  April. 23  Denison, Kutkoski, Read  
16  April. 28  Lowden (Spectual Clustering for Brain Data), Monaco, Tadros 

16  April. 30 
Schnaitman (Movie Recommendation Sys.), Tanner 
For assignments, students may discuss answers with anyone, including problem approaches and proofs. But all students must write their own proofs, and writeups. The names of all people that you have talked to should be listed at the beginning of the first page. If a solution comes from existing papers/web/books, they must be properly cited, and you must write the solution in a way that demonstrates your understanding (simply copying the solution will be considered as plagiarism). All deadlines are firm. No late assignments will be accepted unless there are legitimate circumstances.
For projects, you may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged. Any outside materials used must be referenced appropriately.
For more details, see Indiana University Code of Student Rights, Responsibilities, and Conduct.