Bucketsort
General information
Algorithmic problem: Sorting sequences of strings
Type of algorithm: loop
Axiliary data:
- An ordered sequence [math]\displaystyle{ S' }[/math] of strings, which will eventually hold the overall result of the algorithm.
- The buckets, that is, an array [math]\displaystyle{ B }[/math] whose index range contains the ID range of [math]\displaystyle{ \Sigma }[/math] (e.g. 26 for the alphabet) and whose components are ordered sequences of strings.
- Let [math]\displaystyle{ N }[/math] denote the maximum length of an input string.
- An array [math]\displaystyle{ A }[/math] with index range [math]\displaystyle{ [1,\dots,N] }[/math] and multisets of strings as components.
Abstract view
Invariant: After [math]\displaystyle{ i \geq 0 }[/math] iterations:
- For [math]\displaystyle{ j \in \{1,\dots,N-i\} }[/math], [math]\displaystyle{ A[j] }[/math] contains all input strings of length [math]\displaystyle{ j }[/math].
- [math]\displaystyle{ S' }[/math] contains all other input strings, that is, the ones with length at least [math]\displaystyle{ N - i + 1 }[/math]. The sequence [math]\displaystyle{ S' }[/math] is sorted according to the following definition of comparison: [math]\displaystyle{ str1 \lt str2 }[/math] (resp. [math]\displaystyle{ str1 \leq str2 }[/math]) means that the substring of [math]\displaystyle{ str1 }[/math] starting at position [math]\displaystyle{ N - i + 1 }[/math] is lexicographically smaller (resp., smaller or equal) than the substring of [math]\displaystyle{ str2 }[/math] starting at position [math]\displaystyle{ N - i + 1 }[/math].
- All buckets are empty.
Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].
Break condition: [math]\displaystyle{ i = N }[/math].
Induction basis
Abstract view: The auxiliary data must be initialized.
Implementation:
- [math]\displaystyle{ S' := \emptyset }[/math]
- For each [math]\displaystyle{ c \in \Sigma }[/math], set [math]\displaystyle{ B[c] := \emptyset }[/math]
- For each [math]\displaystyle{ i \in \{1,\dots,N\} }[/math], set [math]\displaystyle{ A[i] := \emptyset }[/math]
- Move each string [math]\displaystyle{ str }[/math] in [math]\displaystyle{ S }[/math] to [math]\displaystyle{ A[l(str)] }[/math] where [math]\displaystyle{ l(str) }[/math] denotes the length of [math]\displaystyle{ str }[/math].
Proof: Obvious.
Induction step
Abstract view:
- Move each string [math]\displaystyle{ str }[/math] in [math]\displaystyle{ A[N-i+1] }[/math] to [math]\displaystyle{ B[str[N-i+1]] }[/math]
- Afterwards: Append each string [math]\displaystyle{ str }[/math] in [math]\displaystyle{ S' }[/math] at the tail of [math]\displaystyle{ B[str[N - i + 1]] }[/math]. It is essential that the strings are considered in the order in which they are stored in [math]\displaystyle{ S' }[/math]. (Now [math]\displaystyle{ S' }[/math] is empty.)
- For each [math]\displaystyle{ c \in \Sigma }[/math] in ascending order: append [math]\displaystyle{ B[c] }[/math] at the tail of [math]\displaystyle{ S' }[/math] and make [math]\displaystyle{ B[c] }[/math] empty.
Implementation: Obvious
Correctness: We have to show that the relative order of each pair of strings, [math]\displaystyle{ str1 }[/math] and [math]\displaystyle{ str2 }[/math], in the new sequence [math]\displaystyle{ S' }[/math] are correct. For that, we distinguish between four cases:
- If [math]\displaystyle{ str1[N-i+1] \neq str2[N-i+1] }[/math], [math]\displaystyle{ str1 }[/math] and [math]\displaystyle{ str2 }[/math] are placed in different buckets in Step 3 of the [math]\displaystyle{ i }[/math]-th iteration, so they are correctly ordered according to the character at position [math]\displaystyle{ N - i + 1 }[/math].
- If [math]\displaystyle{ str1[N-i+1] = str2[N-i+1] }[/math] and both strings have length [math]\displaystyle{ N - i + 1 }[/math], their relative order is irrelevant, so nothing is to show.
- Next, consider the case that [math]\displaystyle{ str1[N-i+1] = str2[N-i+1] }[/math], one string (say, [math]\displaystyle{ str1 }[/math]) has length [math]\displaystyle{ N - i + 1 }[/math] and the other one ([math]\displaystyle{ str2 }[/math]) has a length strictly greater than [math]\displaystyle{ N - i + 1 }[/math]. Since the strings of length [math]\displaystyle{ N - i + 1 }[/math] are placed in their buckets prior to the longer strings (cf. Steps 1 and 2), [math]\displaystyle{ str2 }[/math] appears after [math]\displaystyle{ str1 }[/math], which is correct.
- Finally, consider the case that [math]\displaystyle{ str1[N-i+1] = str2[N-i+1] }[/math], and the lengths of both strings are larger than [math]\displaystyle{ N - i + 1 }[/math]. Then the induction hypothesis guarantees the correct relative order due to the specific order in which the strings are considered in Step 2.
Complexity
Statement: Let [math]\displaystyle{ M }[/math] denote the total sum of all input string lengths. Then the asymptotic complexity is in [math]\displaystyle{ \Theta(M) }[/math] in the best and the worst case.
Proof: Obviously, the preprocessing takes [math]\displaystyle{ O(M) }[/math] time. In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.