More trees

Balanced binary search trees (AVL tree) | Non-binary trees (B-tree and variants) | spatial tree

Review of BST

BSTs support operations such as: insert, delete, find min./max. and find successor and predecessor nodes in $O(h)$ time, where h is the height of the tree.
The problem with BSTs: $h$ could be $log(n)$, when the BST is balanced, but $h$ can become as large as $n$ if the BST is very unbalanced.

Balanced binary search tree

AVL tree (named after its inventors: G.M. Adel’son-Vel’skii and E.M. Landis)

Definition: An AVL tree is a Binary Search Tree (BST) with an additional property:
the subtrees of every node differ in height by at most one (every subtree is also an AVL tree).
An AVL tree is a height balanced BST.

Figure 1. BST & AVL

Figure 2. BST but not AVL
Rotation: A rotation is a local operation in a search tree that preserves in-order traversal key ordering.
How do AVL trees work?
Adding or removing a leaf from an AVL tree may make many nodes violate the AVL balance condition, but each violation of AVL balance can be restored by one or two simple changes called rotations. Assume the insertion of a new node turns a previoulsy balanced AVL tree into an unbalanced one: so the nodes in the new tree can only be unbalanced by a difference of at most 2 in the subtrees. Assume $S$ is the bottommost unbalanced node, there are only four cases:
- Case 1: the new node is in the left child of the left child of S (left-left case).
- Case 2: the new node is in the right child of the left child of S (left-right case).
- Case 3: the new node is in the left child of the right child of S (right-left case).
- Case 4: the new node is in the right child of the right child of S (right-right case).
Figure 3 (Shaffer, Fig. 13.5): Case 1 can be re-balanced using a single rotation (rotateRight)
```
/* pseudo code for right rotation for solving Case 1 */
/* Rotates node S to the right, making its left child into its parent. */
Node rotateRight(Node S) { //S: the bottommost unbalanced node
    Node X = S.getLeft(); //X: the left child of S
    S.setLeft(X.getRight()); //set S's left child's right child as S's left child
    X.setRight(S); //set S as its left child (X)'s right child (X becomes the parent of S)
    return X; //return the new parent of S
}
```
Case 4 is symmetrical to Case 1: can be re-balanced using a single rotation (rotateLeft)

Figure 4 (Shaffer, Fig. 13.6). Case 2 (Case 3 is symmetrical to Case 2): can be re-balanced using a double rotation.
Another visualization of the four different cases

(Image credit: https://www.hackerrank.com/challenges/self-balancing-tree/problem)
Time complexity for insertion:
The code walks down the tree from the root to find where the new leaf goes and adds it ($\Theta(log n)$). Then, on the way back up, it looks for AVL violations and fixes them with rotations. Since a constant ($\Theta(1)$) amount of work is done at each node along the path, the total amount of work is proportional to the length of the path, which is $\Theta(log n)$.
Visualizing operations in an AVL tree (from Data Structure Visualizations)
Examples:

Red-black tree (for bored!)

A red-black tree is a binary search tree with one extra attribute for each node: the color, which is either red or black. A red-black tree is a binary search tree which has the following red-black properties:

Every node is either red or black.
Every leaf (NULL) is black.
If a node is red, then both its children are black.
Every simple path from a node to a descendant leaf contains the same number of black nodes.

These constraints enforce a critical property of red–black trees: the path from the root to the farthest leaf is no more than twice as long as the path from the root to the nearest leaf. The result is that the tree is roughly height-balanced.
Lemma: A red-black tree with $n$ internal nodes has height at most $2log(n+1)$.
This demonstrates why the red-black tree is a good search tree: it can always be searched in O(log n) time.
Apparently, insertion or removal of an element may violate the properties of a red–black tree. Restoring the red–black properties requires recoloring and some rotations, which can be done efficiently.

The Splay tree (for bored!)

A BST tree but reimplements the BST insert, delete, and search methods to improve the performance of a BST. Unlike the AVL tree, the splay tree is not guaranteed to be height balanced. What's guaranteed is that the total cost of the entire series of accesses will be cheap. The Splay tree shares similar ideas as the move to front list. When a node is accessed, it is moved to the root of the BST, a process called splaying; when a node is deleted, its parent becomes the root of the BST. As in the AVL tree, a splay of a node consists of a series of rotations.
Visualizing operations in a Splay tree (from Data Structure Visualizations)

Randomized BST & Treap (tree + heap)

Ref 1: http://jeffe.cs.illinois.edu/teaching/algorithms/notes/10-treaps.pdf

Non-binary trees

2-3 trees, B-trees and B+-trees (slides)

Efficient disk based data structure that stores key/value pairs for data that does not fit in memory.
2-3 trees:
A node contains one or two keys; every internal node has either two children (if it contains one key) or three children (if it contains two keys); and all leaves are at the same level in the tree. 2-3 trees are alway height balanced.
B-trees:
B-trees and some variant of B-trees are the standard file organization for applications requiring insertion, deletion, and key range searches. They are used to implement most modern file systems. B-tree is used in HDF5 file format
- defining properties of a B-tree of order $m$:
  - the root is either a leaf or has at least two children
  - each internal node, except for the root, has between $\lceil m/2 \rceil$ and $m$ children, for example:
    for m = 3, each internal node has at most 3 children, and at least $\lceil 3/2 \rceil$ = 2 children
  - all leaves are at the same level in the tree: the tree is height-bound.
- advantages of B-trees:
  - B-trees are always height balanced with all leaf nodes at the same level
  - for data structures in persistent storage, e.g. hard disks:
    update and search operations affect only a few disk blocks.
    The fewer the number of disk blocks affected, the less disk I/O is required.
  - B-trees keep related records on the same disk block,
    which helps to minimize disk I/O on searches due to locality of reference
  - B-trees guarantee that every node in the tree will be full at least to a certain minimum percentage.
    This improves space efficiency while reducing the typical number of disk fetches necessary during a search or update operation.
- a 2-3 tree is a B-tree of order three.
- insert algorithm for m-order B-trees: (see slides)
  - given new element x, search for x in B-tree and find leaf node where x needs to be inserted
  - try inserting element x in the appropriate leaf, in correct order:
  - before inserting, test for size of leaf: if size<(m-1), insert [done].
    Otherwise, continue:
  - if there are already m elements in the leaf:
B⁺-trees and B^*-trees are variants of B-trees.
- The B⁺-tree stores records only at the leaf nodes (internal nodes store key values as placeholders to guide the search);
  (The B+-tree is essentially a mechanism for managing a sorted array-based list, where the list is broken into chunks. How big the chunks should be? For disk-based applications, since the data are on disk, it seems reasonable to store a chunk that is the size of a disk sector, or a small multiple of the disk sector size; a sector typically stores 4096 byte)
- The B^*-tree is identical to the B⁺-tree, except for the rules used to split and merge nodes: instead of spliting a node in half when it overflows, the B^*-tree gives some records to its neighboring sibling, if possible. If the sibling is also full, then these two nodes split into three.
Time complexity: the asymptotic cost of search, insertion, and deletion of records from B-trees, B⁺-trees, and B^*-trees is $\Theta(logn)$ where $n$ is the total number of records in the tree. The base of the log is the (average) branching factor of the tree. Typical database applications use high branching factors (100 or more), so in practice the B-tree and its variants are shallow.
Visualizing operations in a B tree & a B⁺ tree (from Data Structure Visualizations)

Locality of reference

Locality of reference or the principle of locality: accesses of data (in disk or memory) are clustered in time (temporary locality) and space (spatial locality). Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration. Spatial locality says that if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future.
Arrays are implemented as a sequence of consecutive memory locations. Therefore, if the indices used in successive array operations have locality, the corresponding memory addresses will also exhibit locality.
B-trees were originally invented for storing data structures on disk, where locality is crucial for an application's performance.
B-trees may also useful for in-memory data structures because these days main memory is almost as slow relative to the processor as disk drives were to main memory when B-trees were first introduced (today a memory access can be hundreds of times more costly than an arithmetic operation).
Caches improve performance when memory accesses exhibit locality: accesses are clustered in time and space, so that reads from memory tends to request the same locations repeatedly, or even memory locations near previous requests. Caches are designed to work well with computations that exhibit locality; they will have a high cache hit ratio. (read more about how cache works & cache conscious data structures)
Accessing time: CPU cache (nanoseconds; L1, L2, ...), DRAM (nanoseconds; slower than the slowest CPU cache), disk (milliseconds)
Some slides: Locality and Caching; Impact of caches on programming

Spatial data structure: the K-D tree

The k-d tree is a modification to the BST that allows for efficient processing of multidimensional keys. The k-d tree differs from the BST in that each level of the k-d tree makes branching decisions based on a particular search key associated with that level, called the discriminator.

Figure 5 (Shaffer, Fig. 13.11). Example of a k-d tree. Each split is shown by a line, vertical for nodes with x discriminators and horizontal for nodes with y discriminators. The root node (line A) splits the space into two parts; its children further subdivide the space into smaller parts.
Searching a k-d tree for the record with a specified xy-coordinate is like searching a BST, except that each level of the k-d tree is associated with a particular discriminator (x, or y).
Consider searching the above k-d tree for a record located at P=(69, 50).
Starting from the root (A), P does match A, so the search needs to continue.
As the discriminator is x at this level, we compare P's x (69) to this node's x (40) and we should branch to the right subtree (all locations with x value greater than or equal to 40 are in the right subtree).
Then we compare P's y to C's y to decide which direction to go at level 2.
The search process continues until it finds a match or reach a null pointer (no match).
Range search
Example: given a query location (66, 85), return all landmarks that are within a distance of 4.
How it is different from exact match? At some nodes, search may need to go into both branches (left and right).
Read more about why K-D trees don't work with data of high dimensionality. curse of dimensionality (rule of thumb: to use K-D trees, the number of points must be $>>2^d$

Spatial data structure: the PR quadtree (for bored)

In the PR quadtree (Point-Region Quadtree), each node either has exactly four children or is a leaf (i.e., the PR quadtree is a full 4-ary tree).

Figure 6 (Shaffer, Fig. 13.16). Example of a PR quadtree.

Another use of quadtrees is not for storing Point-Region data,
but for algorithms that use progressive refinements of heavy computation by working out details only where necessary:

Realistic Illumination of a Scene using Radiosity with Progressive Refinement.

Spatial data structure: R-tree (R is for rectangle) (more on wikipedia)(for bored)

Check out a paper on SpatialHadoop