Hashing

Search algorithms that use hashing work by direct access based on key value. Hashing involves a hash function: a function that transforms the search key into an array index, and a collision-resolution process that deals with the cases where multiple keys are mapped to the same index. For applications where access involves only exact-match queries, hashing is usually the search method of choice because it is extremely efficient when implemented correctly.

Applications

(basically, applications that need search and don't need ordering)

searchable dictionaries
text search in word processors, etc.
spellchecking
web search engines (how google works)
databases
internet routers
file merge and sync

Direct access table

As from the search problem definition, if we have a collection L of n records of the form $(k_1, I_1), (k_2, I_2), ..., (k_n, I_n)$, where $I_j$ is information associated with key $k_j$ from record $j$ for $1\leq j \leq n$. Given a particular key value $K$, the search problem is to locate a record $(k_j, I_j)$ in L such that $k_j = K$ (if one exists). Searching is a systematic method for locating the record (or records) with key value $k_j = K$.
Simplest approach: array of items, where the key key $k_j$ is the index into the array and the information $I_j$ is the value stored in the array. Main drawbacks:

uses a lot of memory if the set of possible keys is large (e.g. integer encoding of English language words?!?)
keys must be integers

Main advantages:

there are no key collisions
it's easy to make it fast

Solutions to drawbacks:

Avoid occupying memory for unused entries in the array (this is the actual 'hashing').
Map all key values to non-negative integers (this would be 'pre-hashing'). Naive implementation all keys are finite and discrete (e.g. integer encoding of English language words). In Java, this is implemented by hashCode()

Hash functions

The function that maps key values to positions is called a hash function (h).
Typically there are many more values in the key range than there are slots in the hash table.
Given a hash function h and two keys k1 and k2, if $h(k1)=h(k2)=\beta$, we say that k1 and k2 have a collision at slot $\beta$ under hash function h.
Perfect hashing is a system in which records are hashed such that there are no collisions (e.g., indexing k-mers when k is small).
An ideal hash function stores the actual records in the collection such that each slot in the hash table has equal probability of being filled; but clustering of records happens (many records hash to only a few of the slots)
h(K) = 1 is a legal hash function, but it is not very useful---everything will be clustered into one slot!
Common hash functions
- Positive integers: modular hashing
```
//A function used to hash integers to a table of M slots
int h(int x, int M) {
   return x % M;
}
```
- Floating-point numbers: modular hashing on the binary representation of the key
- Strings
```
int h(String s, int M) {
	//mask off the sign bit (to turn the 32-bit integer into a 31-bit nonnegative integer)
	//s.hashCode() % M won't work perfectly: it gives -48 for String "polygenelubricants" when M = 100
	return s.hashCode() & 0x7fffffff) % M;	
}
```
- Folding approach & mid-square method (Shaffer 9.4.1)

Exercise:
Insert the keys { 7, 18, 2, 3, 14, 25, 1, 11, 12, 1332 }, using open hashing (chaining), in a hash table with $m=11$.
Draw the table (with linked lists where necessary) that you obtain by using the modular hash function h(k) = k mod 11.

Collision resolution

While the goal of a hash function is to minimize collisions, some collisions are unavoidable in practice.
Hashing implementations must include some form of collision resolution policy.
Two class of collision resolution techniques:
- Open hashing (also called separate chaining)---collisions are stored outside the table.
  This approach is to build, for each of the slots in the table, a linked list of the key-value pairs whose keys hash to that slot (index).
- Closed hashing (also called open addressing)---collisions result in storing one of the records at another slot in the table.

Open hashing

The simplest form of open hashing defines each slot in the hash table to be the head of a linked list. All records that hash to a particular slot are placed on that slot’s linked list.
Records within a slot’s list can be ordered in several ways: by insertion order, by key value order, or by frequency-of-access order.
The average cost for hashing should be Θ(1); however, if clustering of records exists, then the cost to access a record can be much higher because many elements on the linked list must be searched.
The worst case is Θ(n) ... but we can expect a constant time in most case (simple uniform hashing, or universal hashing) Θ(1+α).

Closed hashing

Closed hashing stores all records directly in the hash table.
- A collision resolution policy mush be built to determine which slot to use when collision is detected.
- Probe sequence is a sequence of slots that can potentially hold a record: the first slot is the home position, and the second slot will be used if the first one is occupied, and so on.
- The probe sequence can be generated by some probe function (p).
- The same policy must be followed during search as during insertion (insertion & search need to use the same probe sequence).
Some common closed hashing:
- Bucket hashing --- overflow goes to an overflow bucket
- Linear probing --- when there is a collision (when we hash to a table index that is already occupied with a key different from the search key), we just check the next entry in the table (by incrementing the index). $p(K, i) = i$, which returns an offset from the original home position (e.g., $p(K, 1)$ returns the first position in the probe sequence after the home slot for key $K$).
  - Retrieve (no deletions prior to retrieval) --- follow the same probe sequence that insert used until one of the three conditions is met: 1) the query item is found; 2) an empty slot is found; or 3) the entire table is visited.
  - Deletion --- find the item using the retrieval operion, and empty the slot
  - Retrieve after deletion --- a slot can be occupied (currently in use), empty (has never been used), or deleted (once was occupied but is now available). So the retrieve operation can continue probing when it encounters a location in the deleted state; Similarly, elements can be inserted into either empty or deleted slots.
  - The tendency of linear probing to cluster items together is known as primary clustering.
- Improved linear probing---use linear probing, but to skip slots by a constant c other than 1. $p(K, i)=ci$.
- Pseudo-random probing: $p(K, i) = Perm[i - 1]$ (again, the same probe sequence needs to be applied for insertion & searching)
- Quadratic probing: $p(K, i) = c_1i^2 + c_2i + c_3$ for some choice of constants $c_1$, $c_2$, and $c_3$.
  The right combination of probe function and table size will visit many slots in the table. E.g., if the hash table size is a power of two and the probe function is $p(K, i) = (i^2 + i)/2$, then every slot in the table will be visited by the probe function.

Hashtable and Hashmap in Java

Both Hashtable & Hashmap implement Map interface (MapDemo.java)
Read this article for differences between these two classes (Hashtable is thread-safe)
Hashtable in Java
How to create a Hashtable?
```
//create a new hashtable with specified initial capacity & specified load factor
Hashtable(int initialCapacity, float loadFactor)
//create a new hashtable with specified initial capacity & default load factor (0.75)
Hashtable(int initialCapacity)
//create a new hashtable with default initial capacity (11) & default load factor (0.75)
Hashtable()
```
An instance of Hashtable has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. Note that the hash table is open: in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.

Common Hashtable operations (HashtableDemo.java):
- How to put object into Hashtable?
- How to find size of Hashtable?
- How to check if Hashtable is empty?
- How to retrieve object from Hashtable?
- How to check if Hastable contains a particular value?
- How to check if Hashtable contains a particular key?
- How to traverse (enumerate) Hashtable?
- How to get all values from hashtable?
- How to get all keys from hashtable?

Class Object & hashCode

Class Object is the root of the class hierarchy. Every class has Object as a superclass. All objects, including arrays, implement the methods of this class.
- boolean equals(Object obj) //indicates whether obj is "equal to" this one
- int hashCode() //return a hash code value for this object---an integer (NOT the index)
```
//hashCode for Strings
int hashCode(String s) {
	int hash = 0;
	int R = 31;
	//31 is prime; also a Mersenne prime (a prime number of the form 2^n-1: 3, 7, 31, 127,...)
	for (int i = 0; i < s.length(); i++)
		hash = R * hash + s.charAt(i);
	return hash;
}
```

Google SparseHash

Google sparsehash contains several hash-map implementations: sparse_hash_map uses very little space overhead, 1-2 bits per entry; dense_hash_map is very fast, particulary on lookup. (sparse_hash_set and dense_hash_set are the set versions of these routines.) On the other hand, these classes have requirements that may not make them appropriate for all applications.
All these implementation use a hashtable with internal quadratic probing. This method is space-efficient -- there is no pointer overhead -- and time-efficient for good hash functions.
See this article for explanation, and performance comparison
SparseHash on github