Hashing
Search algorithms that use hashing work by direct access based on key value. Hashing involves a hash function: a function that transforms the search key into an array index, and a collision-resolution process that deals with the cases where multiple keys are mapped to the same index. For applications where access involves only exact-match queries, hashing is usually the search method of choice because it is extremely efficient when implemented correctly. |
Applications
(basically, applications that need search and don't need ordering)- searchable dictionaries
- text search in word processors, etc.
- spellchecking
- web search engines (how google works)
- databases
- internet routers
- file merge and sync
Direct access table
As from the search problem definition, if we have a collection L of n records of the form $(k_1, I_1), (k_2, I_2), ..., (k_n, I_n)$, where $I_j$ is information associated with key $k_j$ from record $j$ for $1\leq j \leq n$. Given a particular key value $K$, the search problem is to locate a record $(k_j, I_j)$ in L such that $k_j = K$ (if one exists). Searching is a systematic method for locating the record (or records) with key value $k_j = K$.Simplest approach: array of items, where the key key $k_j$ is the index into the array and the information $I_j$ is the value stored in the array. Main drawbacks:
- uses a lot of memory if the set of possible keys is large (e.g. integer encoding of English language words?!?)
- keys must be integers
- there are no key collisions
- it's easy to make it fast
- Avoid occupying memory for unused entries in the array (this is the actual 'hashing').
- Map all key values to non-negative integers (this would be 'pre-hashing'). Naive implementation all keys are finite and discrete (e.g. integer encoding of English language words). In Java, this is implemented by hashCode()
Hash functions
- The function that maps key values to positions is called a hash function (h).
- Typically there are many more values in the key range than there are slots in the hash table.
- Given a hash function h and two keys k1 and k2, if $h(k1)=h(k2)=\beta$, we say that k1 and k2 have a collision at slot $\beta$ under hash function h.
- Perfect hashing is a system in which records are hashed such that there are no collisions (e.g., indexing k-mers when k is small).
- An ideal hash function stores the actual records in the
collection such that each slot in the hash table has equal probability
of being filled; but clustering of records happens (many records hash to
only a few of the slots)
h(K) = 1 is a legal hash function, but it is not very useful---everything will be clustered into one slot! - Common hash functions
- Positive integers: modular hashing
//A function used to hash integers to a table of M slots int h(int x, int M) { return x % M; }
- Floating-point numbers: modular hashing on the binary representation of the key
- Strings
int h(String s, int M) { //mask off the sign bit (to turn the 32-bit integer into a 31-bit nonnegative integer) //s.hashCode() % M won't work perfectly: it gives -48 for String "polygenelubricants" when M = 100 return s.hashCode() & 0x7fffffff) % M; }
- Folding approach & mid-square method (Shaffer 9.4.1) The mid-square method spuares the key value, and then takes the middle r bits of the result, giving a value in the range 0 to $2^r - 1$. This works well because most or all bits of the key value contribute to the result. Eg., key value=4567, hashed to slot #57 in the table of size 100 (i.e., in the range of 0 to 99) (see Shaffer Fig. 9.2).
The folding approach for strings sums the ASCII values of the letters in a string, then apply the modulus operator to get the slot number. If the hash table size is small, this method does a good job of distributing strings evenly among the hash table slots. However, if the table is large (e.g., of size 1000), the distribution of the key values in the table will be uneven.
Exercise: Insert the keys { 7, 18, 2, 3, 14, 25, 1, 11, 12, 1332 }, using open hashing (chaining), in a hash table with $m=11$. Draw the table (with linked lists where necessary) that you obtain by using the modular hash function h(k) = k mod 11. |
Collision resolution
- While the goal of a hash function is to minimize collisions, some collisions are unavoidable in practice.
- Hashing implementations must include some form of collision resolution policy.
- Two class of collision resolution techniques:
- Open hashing (also called separate chaining)---collisions are
stored outside the table.
This approach is to build, for each of the slots in the table, a linked list of the key-value pairs whose keys hash to that slot (index). - Closed hashing (also called open addressing)---collisions result in storing one of the records at another slot in the table.
- Open hashing (also called separate chaining)---collisions are
stored outside the table.
Open hashing
- The simplest form of open hashing defines each slot in the hash table to be the head of a linked list. All records that hash to a particular slot are placed on that slot’s linked list.
- Records within a slot’s list can be ordered in several ways: by insertion order, by key value order, or by frequency-of-access order.
- The average cost for hashing should be Θ(1); however, if clustering
of records exists, then the cost to access a record can be much higher
because many elements on the linked list must be searched.
The worst case is Θ(n) ... but we can expect a constant time in most case (simple uniform hashing, or universal hashing) Θ(1+α).
Closed hashing
- Closed hashing stores all records directly in the hash table.
- A collision resolution policy mush be built to determine which slot to use when collision is detected.
- Probe sequence is a sequence of slots that can potentially hold a record: the first slot is the home position, and the second slot will be used if the first one is occupied, and so on.
- The probe sequence can be generated by some probe function (p).
- The same policy must be followed during search as during insertion (insertion & search need to use the same probe sequence).
- Some common closed hashing:
- Bucket hashing --- overflow goes to an overflow bucket
- Linear probing --- when there is a collision (when we
hash to a table index that is already occupied with a key different from
the search key), we just check the next entry in the table (by
incrementing the index). $p(K, i) = i$, which returns an offset from the
original home position (e.g., $p(K, 1)$ returns the first position in the probe sequence after the home slot for key $K$).
- Retrieve (no deletions prior to retrieval) --- follow the same probe sequence that insert used until one of the three conditions is met: 1) the query item is found; 2) an empty slot is found; or 3) the entire table is visited.
- Deletion --- find the item using the retrieval operion, and empty the slot
- Retrieve after deletion --- a slot can be occupied (currently in use), empty (has never been used), or deleted (once was occupied but is now available). So the retrieve operation can continue probing when it encounters a location in the deleted state; Similarly, elements can be inserted into either empty or deleted slots.
- The tendency of linear probing to cluster items together is known as primary clustering.
- Improved linear probing---use linear probing, but to skip slots by a constant c other than 1. $p(K, i)=ci$.
- Pseudo-random probing: $p(K, i) = Perm[i - 1]$ (again, the same probe sequence needs to be applied for insertion & searching)
- Quadratic probing: $p(K, i) = c_1i^2 + c_2i + c_3$ for some choice of constants $c_1$, $c_2$, and $c_3$.
The right combination of probe function and table size will visit many slots in the table. E.g., if the hash table size is a power of two and the probe function is $p(K, i) = (i^2 + i)/2$, then every slot in the table will be visited by the probe function.
Hashtable and Hashmap in Java
- Both Hashtable & Hashmap implement Map interface (MapDemo.java)
Read this article for differences between these two classes (Hashtable is thread-safe) - Hashtable in Java
How to create a Hashtable?
//create a new hashtable with specified initial capacity & specified load factor Hashtable(int initialCapacity, float loadFactor) //create a new hashtable with specified initial capacity & default load factor (0.75) Hashtable(int initialCapacity) //create a new hashtable with default initial capacity (11) & default load factor (0.75) Hashtable()
An instance of Hashtable has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. Note that the hash table is open: in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.
Common Hashtable operations (HashtableDemo.java):
- How to put object into Hashtable?
- How to find size of Hashtable?
- How to check if Hashtable is empty?
- How to retrieve object from Hashtable?
- How to check if Hastable contains a particular value?
- How to check if Hashtable contains a particular key?
- How to traverse (enumerate) Hashtable?
- How to get all values from hashtable?
- How to get all keys from hashtable?
Class Object & hashCode
- Class Object is the root of the class hierarchy. Every class has
Object as a superclass. All objects, including arrays, implement the
methods of this class.
- boolean equals(Object obj) //indicates whether obj is "equal to" this one
- int hashCode() //return a hash code value for this object---an integer (NOT the index)
//hashCode for Strings int hashCode(String s) { int hash = 0; int R = 31; //31 is prime; also a Mersenne prime (a prime number of the form 2^n-1: 3, 7, 31, 127,...) for (int i = 0; i < s.length(); i++) hash = R * hash + s.charAt(i); return hash; }
Google SparseHash
- Google sparsehash contains several hash-map implementations: sparse_hash_map uses very little space overhead, 1-2 bits per entry; dense_hash_map is very fast, particulary on lookup. (sparse_hash_set and dense_hash_set are the set versions of these routines.) On the other hand, these classes have requirements that may not make them appropriate for all applications.
- All these implementation use a hashtable with internal quadratic probing. This method is space-efficient -- there is no pointer overhead -- and time-efficient for good hash functions.
- See this article for explanation, and performance comparison
- SparseHash on github