Bucketsort

From Algowiki
Jump to: navigation, search
Bucketsort

General information

Algorithmic problem: Sorting Sequences of Strings

Type of algorithm: loop

Auxiliary data:

  1. An ordered sequence [math]S'[/math] of strings, which will eventually hold the overall result of the algorithm.
  2. The buckets, that is, an array [math]B[/math] whose index range includes the ID range of [math]\Sigma[/math] (e.g. 0...127 suffices for all alphabets covered by ASCII) and whose components are ordered sequences of strings.
  3. Let [math]N[/math] denote the maximum length of an input string.
  4. An array [math]A[/math] with index range [math][1,\dots,N][/math] and multisets of strings as components.

Abstract view

Invariant: After [math]i \geq 0[/math] iterations:

  1. For [math]j \in \{1,\dots,N-i\}[/math], [math]A[j][/math] contains all input strings of length [math]j[/math].
  2. [math]S'[/math] contains all other input strings, that is, the ones with length at least [math]N - i + 1[/math]. The sequence [math]S'[/math] is sorted according to the following definition of comparison: [math]str1 \prec str2[/math] (resp. [math]str1 \preceq str2[/math]) means that the substring of [math]str1[/math] starting at position [math]N - i + 1[/math] is lexicographically smaller (resp., smaller or equal) than the substring of [math]str2[/math] starting at position [math]N - i + 1[/math].
  3. All buckets are empty.

Variant: [math]i[/math] increases by [math]1[/math].

Break condition: [math]N[/math] iterations are completed.

Induction basis

Abstract view: The auxiliary data must be initialized.

Implementation:

  1. [math]S' := \emptyset[/math].
  2. For each [math]c \in \Sigma[/math], set [math]B[c] := \emptyset[/math].
  3. For each [math]i \in \{1,\dots,N\}[/math], set [math]A[i] := \emptyset[/math].
  4. Move each string [math]str[/math] in [math]S[/math] to [math]A[l(str)][/math] where [math]l(str)[/math] denotes the length of [math]str[/math].

Proof: Obvious.

Induction step

Abstract view:

  1. Move each string [math]str[/math] in [math]A[N-i+1][/math] to [math]B[str[N-i+1]][/math].
  2. Afterwards: Append each string [math]str[/math] in [math]S'[/math] at the tail of [math]B[str[N - i + 1]][/math]. It is essential that the strings are considered in the order in which they are stored in [math]S'[/math]. (Now [math]S'[/math] is empty.)
  3. For each [math]c \in \Sigma[/math] in ascending order: append [math]B[c][/math] at the tail of [math]S'[/math] and make [math]B[c][/math] empty.

Implementation: Obvious.

Correctness: We have to show that the relative order of each pair of strings, [math]str1[/math] and [math]str2[/math], in the new sequence [math]S'[/math] is correct. For that, we distinguish four cases:

  1. If [math]str1[N-i+1] \neq str2[N-i+1][/math], [math]str1[/math] and [math]str2[/math] are placed in different buckets in Step 3 of the [math]i[/math]-th iteration, so they are correctly ordered according to the character at position [math]N - i + 1[/math].
  2. If [math]str1[N-i+1] = str2[N-i+1][/math] and both strings have length [math]N - i + 1[/math], their relative order is irrelevant at this stage, so nothing is to show.
  3. Next, consider the case that [math]str1[N-i+1] = str2[N-i+1][/math], one string (say, [math]str1[/math]) has length [math]N - i + 1[/math] and the other one ([math]str2[/math]) has a length strictly greater than [math]N - i + 1[/math]. Since the strings of length [math]N - i + 1[/math] are placed in their buckets prior to the longer strings (cf. Steps 1 and 2), [math]str2[/math] appears after [math]str1[/math], which is correct.
  4. Finally, consider the case that [math]str1[N-i+1] = str2[N-i+1][/math], and the lengths of both strings are larger than [math]N - i + 1[/math]. Then the induction hypothesis guarantees the correct relative order due to the specific order in which the strings are considered in Step 2.

Complexity

Statement: Let [math]M[/math] denote the total sum of all input string lengths. Then the asymptotic complexity is in [math]\Theta(M)[/math] in the best and the worst case.

Proof: Obviously, the preprocessing takes [math]O(M)[/math] time. In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.