Sorting algorithms & applications

Applications

Sorting is also one of the most frequently performed computing tasks, e.g.,

Sort the records in a database so the collection can be searched efficiently.
Sort the records by zip code so that we can print and mail them more cheaply.
Sorting as an intrinsic part of an algorithm to solve some other problem, e.g., the minimum-cost spanning tree.
...

Sorting algorithms

Most sorting algorithms rely on comparisons of key values, except for some special algorithms such as binsort, bucket sort & radix sort, which we'll see below.

Mergesort

A divide and conquer algorithm. Mergesort splits the list in half, sort the halves, and then merge the sorted halves together. How merge works? The merge step compares the front elements of the two sublists and continually appending the smaller to the output list until no more input elements remain.
Mergesort lends itself well to sorting a singly linked list because merging does not require random access to the list elements. Thus, Mergesort is the method of choice when the input is in the form of a linked list.

Cost of mergesort? At each of the log n levels of recursion, $\Theta(n)$ work is done, for a total cost of $\Theta(n log n)$. This cost is unaffected by the relative order of the values being sorted, thus this analysis holds for the best, average, and worst cases.

Quicksort

When properly implemented, quicksort is the fastest known general-purpose in-memory sorting algorithm in the average case.
Time complexity
- Average case: $\Theta(n logn)$
  $T(n) = cn + \frac{1}{n}\sum_{k=0}^{n-1}[T(k) + T(n-1-k)], T(0) = T(1) = c$
  At each partition step, the pivot is equally likely to end in any position in the (sorted) array. In other words, the pivot is equally likely to break an array into partitions of sizes 0 and n-1, or 1 and n-2, and so on. The closed-form solution to this recurrence relation is $\Theta(n log n)$ (Shaffer, page 260).
- Worst case: $\Theta(n^2)$ (When will this worst case occur? Only when each pivot yields a bad partitioning of the array. If the pivot values are selected at random, or the middle position, then this is extremely unlikely to happen.)

Pseudocode

qsort(A, i, j) {
	int pivot = (i+j)/2 //middle position
	swap(A[pivot], A[j]) 
	//swap the pivot value to the end of the array

	int k = partition(A, i-1, j, A[pivot]) 
	//partition: to place values in the right partition
	//values in each partition are not necessarily sorted

	swap(A[k], A[j]) 
	//place the pivot value back to the correct position

	qsort(A, j, k-1) //further sort left partition
	qsort(A, k+1, j) //further sort right parttition
}

Partition is done cleverly using a left & right index and moving indices inwards from the ends of the subarray, swapping values as necesssary until the two indices meet. A demonstration of the partition algorithm is shown below,

Intial  72 6  57 88 60 42 83 73 48 85 (pivot index: 5, pivot value: 60)
                    p 
        72 6  57 88 85 42 83 73 48 60 (pivot value swapped to the end of the array)
      l                            r  (move left & right indices inwards)
        l                       r     (pause when A[l]>pivot, A[r]<pivot)
        48 6  57 88 85 42 83 73 72 60 (72 > 48, swap them)
        l                       r 
           l                 r        (keep moving indices inwards)
                 l     r              (pause)
        48 6  57 42 85 88 83 73 72 60 (swap 42 & 88)
                 l     r              (keep moving indices)
                    lr                (left & right index meet at k) 
        48 6  57 42 60 88 83 73 72 85 (put the pivot back to correct position)
The result: all values end up in the correct partition.
  all values to the left of position k are less than the pivot value (60)
  all values to the right of position k are equal to or great than the pivot value

See another quicksort pseudo code in AIZU (in which the last item in each subarray is chosen as the pivot element)
Qsort.java
More notes on quicksort

Heapsort

Heapsort is based on the heap data structure. Advantages of using heap: the tree is balanced (heaps are complete binary trees); array representation is space efficient; and building heap is relatively cheap.
The algorithm works by repeatedly removing the maximum value (the element at position 0), and then restoring the heap property, until the heap is empty (see the code).
Heap Sort: visualization (external link).
Time complexity of heapsort: building heap takes $\Theta(n)$, and heapsort operations take $\Theta(nlogn)$; so the cost is $\Theta(nlogn)$ in the best, average, and worst cases.
It is not as fast as quicksort in the average case (by a constant factor).
If we wish to find the $k$ largest (or least) elements in an array using heapsort algorithm, what's the time complexity?

$\Theta(n^2)$ sorting algorithms

Insertion sort: insertion sort iterates through a list of records, such that each record is inserted in turn at the correct postion within a sorted list composed of those records already processed.

Bubble sort repeatedly steps through the list to be sorted, compare each pair of adjacent items and swaps them if they are in the wrong order. The pass through the list is repeated until no swaps are needed

//bubble up version
//A is the array of records to be sorted
for(i=0; i < A.length-1; i++)
    swapped = false
    for(j=A.length-1; j > i; j--)  
	   if(A[j] < A[j-1]) 
		swap(A[j], A[j-1])
		swapped = true
    if swapped
	break

//sink down version:
for(i=0; i < A.length-1; i++)
    swapped = false
    for(j=0; j < A.length-i; j++)  
	   if(A[j] < A[j-1]) 
		swap(A[j], A[j-1])
		swapped = true
    if swapped
	break

Selection sort algorithm sorts a list by repeatedly finding the minimum element (considering ascending order) from unsorted part and putting it at the beginning. The algorithm maintains two sublists: one sorted and one unsorted. In every iteration of selection sort, the minimum element from the unsorted subarray is picked and moved to the sorted subarray (by swapping).
```
for(i = 0; i < A.length-1; i ++) 
    lowest-so-far = i
    for(j=A.length-1; j > i; j --)  
	   if(A[j] < A[lowest-so-far]) 
		lowest-so-far = j
    swap(A[i], A[lowest-so-far])
```

"Linear" time sorting algorithms

Counting Sort
Counting Sort: visualization (external link).
Counting Sort: pseudocode and working example (external link).
Sort of a permutation of the numbers 0 through $n - 1$ can be done by Binsort in a $\Theta(n)$ time. Binsort assigns records to bins according to key values (no comparison is involved).
Binsort any collection of records whose key values fall in the range from 0 to MaxKeyValue-1 takes $\Theta(n + MaxKeyValue)$ time (Binsort must look at each of the bins to see if it contains a record to get the sorting task done). If MaxKeyValue is a lot greater than $n$, this results in a poor sorting algorithm (e.g., when $MaxKeyValue = n^2$).
A further generalization to Binsort yields a bucket sort, in which each bin is associated with not just one key, but rather a range of key values. A bucket sort assigns records to bins and then relies on some other sorting technique to sort the records with each bin. E.g., to put a collection of recrods with keys in the range 0 to 99 to 10 bins, we can first assign records to bins by taking their key value % 10 (so each key will be assigned to the bin matching its rightmost decimal digit).
Bucket Sort: visualization (external link).
In Radix Sort, the bin computations are based on the radix or the base of the key values.

Radix Sort requires k (k digits) passes over the list of n numbers in base r, with $\Theta(n + r)$ work done at each pass. Thus the total work is $(nk + rk)$.
Is it a reasonable assumption to treat k as a constant? Assume all keys are unique, it requires at least $\Omega(logn)$ digits (within a constant factor) to distinguish between the $n$ distinct keys. So $k$ is in $\Omega(logn)$. This yields an asymptotic complexity of $\Omega(n logn)$ for Radix sort to process $n$ distinct key values.
Radix Sort: visualization (external link).

Comparison of sorting algorithms

Asymptotic time complexity

	Selection/insertion/bubble	Quicksort	Mergesort	Heapsort	Radix sort
Best case	$\Theta(n^2)$	$\Theta(nlogn)$	$\Theta(nlogn)$	$\Theta(nlogn)$
Average case	$\Theta(n^2)$	$\Theta(nlogn)$	$\Theta(nlogn)$	$\Theta(nlogn)$
Worst case	$\Theta(n^2)$	$\Theta(n^2)$ (unlikely)	$\Theta(nlogn)$	$\Theta(nlogn)$

Lower bounds for comparison algorithms

No algorithm can be more efficient than the problem’s I/O time. From this we see that the sorting problem cannot be solved by any algorithm in less than $\Omega(n)$ time because it takes at least $n$ steps to read and write the n values to be sorted. In other words, any sorting algorithm must at least look at every input value to recognize whether the input values are in sort order.
No sorting algorithm based on key comparisons can possibly be faster than $\Omega(n log n)$ in the worst case.
Based on our current knowledge of sorting algorithms and the size of the input, we know that the problem of sorting is bounded by $\Omega(n)$ and $O(n log n)$.

Side note: why do we care about comparison operations in sorting algorithms?

We would like to evaluate the time complexity of sorting algorithms even if we consider only comparison operations as operation(s) in our sorting:
<, <=, >, >=, ==
If we consider only comparison operations, we can visualize such an algorithm as a "tree of all possible comparison outcomes".

• For example, for a binary search of a value x within a sorted array A containing n elements, we could visualize all the possible choices as a "decision tree" representing the algorithm, thus (here for n=3) :

In this case, when we consider only comparisons, one run of the algorithm is equivalent to traversing the tree from the root to a leaf.
In general, the number of possible answers is >= n , so we have at least n leaves in the tree.
In general, the height of the path in the binary search tree is >= (log n), so we can consider (log n) as a lower bound (can't be less than).
The worst-case running time of the binary search algorithm is $\Omega(log n)$.

• What about the worst-case running time of the sorting algorithm?
(we could also visualize the sorting algorithm in the same way, but it would be a much bigger tree!)
In general, the number of possible answers in the sorting algorithm is >= n! i.e. the number of possible permutations of n elements, which is the number of possible leaves in the decision tree for a sorting algorithm that uses comparisons.
The height of the sorting algorithm search tree is >= (log n!), so we can consider (log n!) as a lower bound (can't be less than) of necessary operations for sorting n elements, i.e. the lower bound is $\Omega(n log n)$.