# Bucketsort

Bucketsort

## General information

Algorithmic problem: Sorting Sequences of Strings

Type of algorithm: loop

Auxiliary data:

1. An ordered sequence $S'$ of strings, which will eventually hold the overall result of the algorithm.
2. The buckets, that is, an array $B$ whose index range includes the ID range of $\Sigma$ (e.g. 0...127 suffices for all alphabets covered by ASCII) and whose components are ordered sequences of strings.
3. Let $N$ denote the maximum length of an input string.
4. An array $A$ with index range $[1,\dots,N]$ and multisets of strings as components.

## Abstract view

Invariant: After $i \geq 0$ iterations:

1. For $j \in \{1,\dots,N-i\}$, $A[j]$ contains all input strings of length $j$.
2. $S'$ contains all other input strings, that is, the ones with length at least $N - i + 1$. The sequence $S'$ is sorted according to the following definition of comparison: $str1 \prec str2$ (resp. $str1 \preceq str2$) means that the substring of $str1$ starting at position $N - i + 1$ is lexicographically smaller (resp., smaller or equal) than the substring of $str2$ starting at position $N - i + 1$.
3. All buckets are empty.

Variant: $i$ increases by $1$.

Break condition: $N$ iterations are completed.

## Induction basis

Abstract view: The auxiliary data must be initialized.

Implementation:

1. $S' := \emptyset$.
2. For each $c \in \Sigma$, set $B[c] := \emptyset$.
3. For each $i \in \{1,\dots,N\}$, set $A[i] := \emptyset$.
4. Move each string $str$ in $S$ to $A[l(str)]$ where $l(str)$ denotes the length of $str$.

Proof: Obvious.

## Induction step

Abstract view:

1. Move each string $str$ in $A[N-i+1]$ to $B[str[N-i+1]]$.
2. Afterwards: Append each string $str$ in $S'$ at the tail of $B[str[N - i + 1]]$. It is essential that the strings are considered in the order in which they are stored in $S'$. (Now $S'$ is empty.)
3. For each $c \in \Sigma$ in ascending order: append $B[c]$ at the tail of $S'$ and make $B[c]$ empty.

Implementation: Obvious.

Correctness: We have to show that the relative order of each pair of strings, $str1$ and $str2$, in the new sequence $S'$ is correct. For that, we distinguish four cases:

1. If $str1[N-i+1] \neq str2[N-i+1]$, $str1$ and $str2$ are placed in different buckets in Step 3 of the $i$-th iteration, so they are correctly ordered according to the character at position $N - i + 1$.
2. If $str1[N-i+1] = str2[N-i+1]$ and both strings have length $N - i + 1$, their relative order is irrelevant at this stage, so nothing is to show.
3. Next, consider the case that $str1[N-i+1] = str2[N-i+1]$, one string (say, $str1$) has length $N - i + 1$ and the other one ($str2$) has a length strictly greater than $N - i + 1$. Since the strings of length $N - i + 1$ are placed in their buckets prior to the longer strings (cf. Steps 1 and 2), $str2$ appears after $str1$, which is correct.
4. Finally, consider the case that $str1[N-i+1] = str2[N-i+1]$, and the lengths of both strings are larger than $N - i + 1$. Then the induction hypothesis guarantees the correct relative order due to the specific order in which the strings are considered in Step 2.

## Complexity

Statement: Let $M$ denote the total sum of all input string lengths. Then the asymptotic complexity is in $\Theta(M)$ in the best and the worst case.

Proof: Obviously, the preprocessing takes $O(M)$ time. In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.