Search problem
Definition
Assume we have a collection L of n records of the form $(k_1, I_1), (k_2, I_2), ..., (k_n, I_n)$, where $I_j$ is information assiciated with key $k_j$ from record $j$ for $1\leq j \leq n$. Given a particular key value $K$, the search problem is to locate a record $(k_j, I_j)$ in L such that $k_j = K$ (if one exists). Searching is a systematic method for locating the record (or records) with key value $k_j = K$.
Exact-match query vs range query
- An exact-match query is a search for the record(s) whose key value matches a specified key value. E.g., search for "Mother Bear" restaurant; find the record with Key value of 35.
- An range query is a search for all records whose key value falls witin a specified range of key values. E.g., search for restaurants that are within 3 miles.
Search algorithms
Sequential and list methods
- Search unsorted arrays
- Sequential search on an unsorted list---the simplest form of search.
- Search sorted arrays
- Jump search on a sorted list
- Binary search on a sorted list---if we know nothing about the distribution of key values.
- Dictionary search (or interpolation search)---if we know the distribution of key values. A dictionary search algorithm
uses some knowledge about the expected distribution of key value to computer where to look next. (E.g., when you look for a word starting with "S", you don't start at the middle of the dictionary). A variation on dictionary search is known as Quadratic Binary Search (QBS) (Shaffer p305-306)
- Self-organizing lists --- order records by expected frequency of access. In many search applications, real access patterns follow a rule of thumb called the 80/20 rule, an example of a Zpif distribution. Te 80/20 rule says that 80% of the record accesses are to 20% of the records. There are three traditional heuristics for managing self-organizing lists:
- count: whenever a record is accessed, it might move toward the front of the list if its number of accesses becomes greater than a record preceding it.
- move-to-front (MTF): bring a record to the front of the list when it is found.
Consider a linear list of items (e.g., singly-linked list). To access the item in the $i_{th}$ position requires time $i$. Also, any two
contiguous items can be swapped in constant time (not including the time to access them in the list). The goal is to allow access to a sequence of $n$ items in a minimal amount of time, starting from an intial list configuration. If the sequence of accesses is known in advance, one can design an optimal algorithm for swapping items to rearrange the list according to how oftem items are accessed, and when.
The MTF is a heuristic that takes advantage of the fact that if item $i$ is accessed at time $t$, it is likely to be accessed again soon after time $t$ (i.e., locality of reference). MTF works by moving item $i$ to the front of the list (via successive swaps from $i$ all the way down to first position) when it is accessed. Amortized analysis can be used to show that MTF always performs within a factor of 4 of any algorithm--even an optimal algorithm that knows the access sequence in advance, and even without making assumptions about locality of reference. (see more)
- transpose: swap any record found with the record immediately preceding it in the list.
Tree indexing methods
- We have seen BST, and heaps
- 2-3 trees & B-trees (highly specialized for specific applications---B-trees are good for implementation of databases) (for bored only!)