# Bucketsort

## General information

**Algorithmic problem:** Sorting Sequences of Strings

**Type of algorithm:** loop

**Auxiliary data:**

- An ordered sequence of strings, which will eventually hold the overall result of the algorithm.
- The
**buckets**, that is, an array whose index range includes the ID range of (e.g. 0...127 suffices for all alphabets covered by ASCII) and whose components are ordered sequences of strings. - Let denote the maximum length of an input string.
- An array with index range and multisets of strings as components.

## Abstract view

**Invariant:** After iterations:

- For , contains all input strings of length .
- contains all other input strings, that is, the ones with length at least . The sequence is sorted according to the following definition of comparison: (resp. ) means that the substring of starting at position is lexicographically smaller (resp., smaller or equal) than the substring of starting at position .
- All buckets are empty.

**Variant:** increases by .

**Break condition:** iterations are completed.

## Induction basis

**Abstract view:** The auxiliary data must be initialized.

**Implementation:**

- .
- For each , set .
- For each , set .
- Move each string in to where denotes the length of .

**Proof:** Obvious.

## Induction step

**Abstract view:**

- Move each string in to .
- Afterwards: Append each string in at the tail of . It is essential that the strings are considered in the order in which they are stored in . (Now is empty.)
- For each in ascending order: append at the tail of and make empty.

**Implementation:** Obvious.

**Correctness:** We have to show that the relative order of each pair of strings, and , in the new sequence is correct. For that, we distinguish four cases:

- If , and are placed in different buckets in Step 3 of the -th iteration, so they are correctly ordered according to the character at position .
- If and both strings have length , their relative order is irrelevant at this stage, so nothing is to show.
- Next, consider the case that , one string (say, ) has length and the other one () has a length strictly greater than . Since the strings of length are placed in their buckets prior to the longer strings (cf. Steps 1 and 2), appears after , which is correct.
- Finally, consider the case that , and the lengths of both strings are larger than . Then the induction hypothesis guarantees the correct relative order due to the specific order in which the strings are considered in Step 2.

## Complexity

**Statement:** Let denote the total sum of all input string lengths. Then the asymptotic complexity is in in the best and the worst case.

**Proof:** Obviously, the preprocessing takes time.
In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.