Data too large to fit into memory

Big data

Data too large to be loaded into main memory--the information must reside on hard disk (or come as stream) and be brought into main memory selectively for processing.
This note is to give you a brief idea to the data structures & algorithms for working with big data, using simple applications.
(note: where the term hard disk is used, you may also consider most any other persistent-storage device)
Check out a collection of links for streaming algorithms and data structures on github.

(Most) Data stay in the hard disk --- Disk-based applications

Working with files: Logical file & physical file
- A programmer views a random access file stored on hard disk as a contiguous series of bytes (logical file).
- The physical file actually stored on hard disk is usually not a contiguous series of bytes---it could be in blocks and pieces spread all over the hard disk.
- The file manager, a part of the operating system, is responsible for taking requests for retrieving data from (or saving data to) a logical file and mapping those requests to the physical location of the data on hard disk. Disk drives are often referred to as direct access storage devides---it takes roughly equal time to access any record in the file.
Data structures and algorithms for hard disk-based applications.
The key is to minimize the number of hard disk accesses!
(main memory access is much faster than access to data stored on hard disk or other persistent-storage devices)
- Use an efficient data structure (file structure) to arrange information that few accesses are needed.
- Save information previously retrieved that can be used to minimize the need for future accesses. This requires the ability to guess accurately what information will be needed later and store it in primary memory now (which is referred to as caching)
An example of hard disk-based application:
External mergesort algorithm (wikipedia)

We have previously seen megesort that works with data in RAM. The external mergesort algorithm sorts chunks that each fit in RAM, then merges the sorted chunks together.For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
- Read 100 MB of the data in main memory and sort by some conventional sorting method, like quicksort.
- Write the sorted data to hard disk.
- Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
- Read the first 10 MB of each sorted chunk (of 100 MB) into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
- Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Streaming algorithms

In the data stream model, some or all of the input data that are to be operated on are not available for random access from hard disk or memory, but rather arrive as one or more continuous data streams. Streams can be denoted as an ordered sequence of points/items/updates that must be accessed in order and can be read only once or a small number of times. Example streaming data include network traffic data, database transactions, sensor networks, and satallite data feed.
Streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item. An streaming algorithm, therefore, often needs to produce an approximate answer based on a summary or "sketch" of the data stream in memory. Sampling & sketching are two basic techniques seen in streaming algorithms.
- Sampling is a general technique for tackling massive amounts of data. E.g., to compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median.
- Sketching is another general technique for processing streams. The basic idea is to apply a linear projection "on the fly" that projects high-dimensional data to a smaller dimensional space. The lower dimensional data is then used to estimate the quantities of interest.
A deterministic algorithm that approximates frequencies for top k items.
1. Maintain set $S$ of $k$ counters, initialize to 0. For each element $x_i$ in stream:
2. If $x_i \in S$ increment the counter for $x_i$.
3. If $x_i \notin S$ add $x_i$ to $S$ if space is available, else decrement all counters in $S$.
An item in S whose count falls to 0 will be removed, the space requirement for storing $k$ counters is $O(k)$ and the update time per item is $O(k)$. The algorithm estimates the count of an item as the value of its counter, or zero if it has no counter.
How well this algorithm performs?
The frequency estimate $n_j$ satisfies $f_j - n / k \leq n_j \leq f_j$, where $f_j$ is the real frequency, and $n_j$ is the estimated frequency.
Proof: it is obvious that $n_j \leq f_j$. Differences between the real and estimated frequencies are caused by one of the two scenarios: a) item $j \notin S$ ($x_j$, each counter in S gets decremented (so $x_j$ is in the stream but the counter for $j$ is not incremented; and b) the counter for $j$ gets decremented due to another elment $i$ that is not contained in $S$. Both scenarios result in $k$ counters gettting decremented hence they can occur at most $n/k$ times. So we have $n_j \geq f_j - n /k$.
Count-Min Sketch for count tracking problem (check out the paper)
The Count-Min Sketch is a (probabilistic) data structure consisting of a fixed array of counters, of width $w$ and depth $d$. Each row of counters is associated with a different hash function. The hash function maps items uniformly onto the range $\{1, 2, \dots, w\}$. It is important that each hash function is different. Otherwise, there is no benefit from the repetition.
The Count-Min Sketch data structure has two methods:
- Update($i$, $c$) operations update the data structure in a straightforward way. In each row, the corresponding hash function is applied to $i$ to determine a corresponding counter. Then the update $c$ is added on to that counter. Similarly, the item is mapped to different locations in each of the other rows. For example, if there are four rows of counters, we will evaluate four different hash functions on an item $i$, and update four counters accordingly.
- Estimate($i$): For Estimate($i$) operation, the process is quite similar. For each row, the corresponding hash function is applied to $i$ to look up one of the counters. Across all rows, the estimate is found as the minimum of all the probed counters.
  
  Schematic of Count-Min sketch data structure (image source)
Why does Count-Min Sketch? The width $w$ in the Count-Min Sketch is often a lot smaller than the total number of items, indicating that each counter is counting the updates to many different items (hash collision). It seems that the estimates cannot be very accurate. However, the hash functions work to spread out the different items, so on average the inaccuracy cannot be too high. Also by using repetitions, the chance of getting an estimate that is much more inaccurate than average is actually quite small. As a result, for a sketch of width $w$ and depth $d$ with total count $N$, it follows that any estimate has error at most $2N/w$, with probability at least $1 - \frac{1}{2}^d$. So setting the parameters $w$ and $d$ large enough allows us to achieve very high accuracy while using relatively little space.
For example, we want an error of at most 0.1% (of the sum of all frequencies), with 99.9% certainty. Then we need to set $\frac{2}{w} = \frac{1}{1000}$ (i.e., $w$ = 2000), and $\frac{1}{2}^d = 0.001$, i.e.., $d = log 0.001/log 0.5 <= 10$. Using 32 bit counters, the space required by the array of counters is $w \times d \times 4$ = 80KB. An error in the scale of 0.1% of the sum of all frequencies sounds too bad for rare items, but it is ok for frequent items. We can set different $w$ and $d$ to control the error rate, and the certainty, respectively for an application.