Bucketsort: Difference between revisions

From Algowiki
Jump to navigation Jump to search
No edit summary
 
(13 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[Category:Videos]]
[[Category:Sorting Algorithms]]
[[Category:Sorting Algorithms]]
<div class="plainlinks" style="float:right;margin:0 0 5px 5px; border:1px solid #AAAAAA; width:auto; padding:1em; margin: 0px 0px 1em 1em;">
{{#ev:youtube|https://www.youtube.com/watch?v=-POIDU_ew98|500|right|Bucketsort|frame}}
<div style="font-size: 1.8em;font-weight:bold;text-align: center;margin:0.2em 0 1em 0">Bucket Sort</div>
 
<div style="font-size: 1.2em; margin:.5em 0 1em 0; text-align:center">whatever</div>
 
<div style="font-size: 1.2em; margin:.5em 0 .5em 0;text-align:center">[[File:olw_logo1.png|20px]][https://openlearnware.tu-darmstadt.de/#!/resource/bucket-sort-1941 Openlearnware]</div>
</div>
 
== General information ==
== General information ==


'''Algorithmic problem:''' [[Sorting sequences of strings]]
'''Algorithmic problem:''' [[Sorting Sequences of Strings]]


'''Type of algorithm:''' loop
'''Type of algorithm:''' loop


'''Axiliary data:'''
'''Auxiliary data:'''
# An ordered sequence <math>S'</math> of strings, which will eventually hold the overall result of the algorithm.
# An ordered sequence <math>S'</math> of strings, which will eventually hold the overall result of the algorithm.
# The '''buckets''', that is, an array <math>B</math> whose index range contains the ID range of <math>\Sigma</math> (e.g. 26 for the alphabet) and whose components are [[Sets and sequences|ordered sequences]] of strings.
# The '''buckets''', that is, an array <math>B</math> whose index range includes the ID range of <math>\Sigma</math> (e.g. 0...127 suffices for all alphabets covered by ASCII) and whose components are [[Sets and sequences|ordered sequences]] of strings.
# Let <math>N</math> denote the maximum length of an input string.
# Let <math>N</math> denote the maximum length of an input string.
# An array <math>A</math> with index range <math>[1,\dots,N]</math> and [[Sets and sequences|multisets]] of strings as components.
# An array <math>A</math> with index range <math>[1,\dots,N]</math> and [[Sets and sequences|multisets]] of strings as components.
Line 24: Line 18:
'''Invariant:''' After <math>i \geq 0</math> iterations:
'''Invariant:''' After <math>i \geq 0</math> iterations:
# For <math>j \in \{1,\dots,N-i\}</math>, <math>A[j]</math> contains all input strings of length <math>j</math>.
# For <math>j \in \{1,\dots,N-i\}</math>, <math>A[j]</math> contains all input strings of length <math>j</math>.
# <math>S'</math> contains all other input strings, that is, the ones with length at least <math>N - i + 1</math>. The sequence <math>S'</math> is sorted according to the following definition of comparison: <math>str1 < str2</math> (resp. <math>str1 \leq str2</math>) means that the substring of <math>str1</math> starting at position <math>N - i + 1</math> is lexicographically smaller (resp., smaller or equal) than the substring of <math>str2</math> starting at position <math>N - i + 1</math>.
# <math>S'</math> contains all other input strings, that is, the ones with length at least <math>N - i + 1</math>. The sequence <math>S'</math> is sorted according to the following definition of comparison: <math>str1 \prec str2</math> (resp. <math>str1 \preceq str2</math>) means that the substring of <math>str1</math> starting at position <math>N - i + 1</math> is lexicographically smaller (resp., smaller or equal) than the substring of <math>str2</math> starting at position <math>N - i + 1</math>.
# All buckets are empty.
# All buckets are empty.


'''Variant:''' <math>i</math> increases by <math>1</math>
'''Variant:''' <math>i</math> increases by <math>1</math>.


'''Break condition: <math>i = N</math>.
'''Break condition:''' <math>N</math> iterations are completed.


== Induction basis ==
== Induction basis ==


'''Abstract view:''' The auxiliary data must be initialized.
'''Abstract view:''' The auxiliary data must be initialized.


'''Implementation:'''
'''Implementation:'''
# <math>S' := \emptyset</math>
# <math>S' := \emptyset</math>.
# For each <math>c \in \Sigma</math>, set <math>B[c] := \emptyset</math>
# For each <math>c \in \Sigma</math>, set <math>B[c] := \emptyset</math>.
# For each <math>i \in \{1,\dots,N\}, set <math>A[i] := \emptyset</math>
# For each <math>i \in \{1,\dots,N\}</math>, set <math>A[i] := \emptyset</math>.
# Move each string <math>str</math> in <math>S</math> to <math>A[l(str)]</math> where <math>l(str)</math> denotes the length of <math>str</math>.
# Move each string <math>str</math> in <math>S</math> to <math>A[l(str)]</math> where <math>l(str)</math> denotes the length of <math>str</math>.


Line 45: Line 38:


== Induction step ==
== Induction step ==
'''Abstract view:'''
# Move each string <math>str</math> in <math>A[N-i+1]</math> to <math>B[str[N-i+1]]</math>.
# Afterwards: Append each string <math>str</math> in <math>S'</math> at the tail of <math>B[str[N - i + 1]]</math>. It is essential that the strings are considered in the order in which they are stored in <math>S'</math>. (Now <math>S'</math> is empty.)
# For each <math>c \in \Sigma</math> in ascending order: append <math>B[c]</math> at the tail of <math>S'</math> and make <math>B[c]</math> empty.
'''Implementation:''' Obvious.
'''Correctness:''' We have to show that the relative order of each pair of strings, <math>str1</math> and <math>str2</math>, in the new sequence <math>S'</math> is correct. For that, we distinguish four cases:
# If <math>str1[N-i+1] \neq str2[N-i+1]</math>, <math>str1</math> and <math>str2</math> are placed in different buckets in Step 3 of the <math>i</math>-th iteration, so they are correctly ordered according to the character at position <math>N - i + 1</math>.
# If <math>str1[N-i+1] = str2[N-i+1]</math> and both strings have length <math>N - i + 1</math>, their relative order is irrelevant at this stage, so nothing is to show.
# Next, consider the case that <math>str1[N-i+1] = str2[N-i+1]</math>, one string (say, <math>str1</math>) has length <math>N - i + 1</math> and the other one (<math>str2</math>) has a length strictly greater than <math>N - i + 1</math>. Since the strings of length <math>N - i + 1</math> are placed in their buckets prior to the longer strings (cf. Steps 1 and 2), <math>str2</math> appears after <math>str1</math>, which is correct.
# Finally, consider the case that <math>str1[N-i+1] = str2[N-i+1]</math>, and the lengths of both strings are larger than <math>N - i + 1</math>. Then the induction hypothesis guarantees the correct relative order due to the specific order in which the strings are considered in Step 2.


== Complexity ==
== Complexity ==
'''Statement:''' Let <math>M</math> denote the total sum of all input string lengths. Then the asymptotic complexity is in <math>\Theta(M)</math> in the best and the worst case.
'''Statement:''' Let <math>M</math> denote the total sum of all input string lengths. Then the asymptotic complexity is in <math>\Theta(M)</math> in the best and the worst case.


'''Proof:''' Obviously, the preprocessing takes <mathO(M)</math> time.
'''Proof:''' Obviously, the preprocessing takes <math>O(M)</math> time.
In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.
In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.

Latest revision as of 22:51, 19 June 2015

Bucketsort

General information

Algorithmic problem: Sorting Sequences of Strings

Type of algorithm: loop

Auxiliary data:

  1. An ordered sequence [math]\displaystyle{ S' }[/math] of strings, which will eventually hold the overall result of the algorithm.
  2. The buckets, that is, an array [math]\displaystyle{ B }[/math] whose index range includes the ID range of [math]\displaystyle{ \Sigma }[/math] (e.g. 0...127 suffices for all alphabets covered by ASCII) and whose components are ordered sequences of strings.
  3. Let [math]\displaystyle{ N }[/math] denote the maximum length of an input string.
  4. An array [math]\displaystyle{ A }[/math] with index range [math]\displaystyle{ [1,\dots,N] }[/math] and multisets of strings as components.

Abstract view

Invariant: After [math]\displaystyle{ i \geq 0 }[/math] iterations:

  1. For [math]\displaystyle{ j \in \{1,\dots,N-i\} }[/math], [math]\displaystyle{ A[j] }[/math] contains all input strings of length [math]\displaystyle{ j }[/math].
  2. [math]\displaystyle{ S' }[/math] contains all other input strings, that is, the ones with length at least [math]\displaystyle{ N - i + 1 }[/math]. The sequence [math]\displaystyle{ S' }[/math] is sorted according to the following definition of comparison: [math]\displaystyle{ str1 \prec str2 }[/math] (resp. [math]\displaystyle{ str1 \preceq str2 }[/math]) means that the substring of [math]\displaystyle{ str1 }[/math] starting at position [math]\displaystyle{ N - i + 1 }[/math] is lexicographically smaller (resp., smaller or equal) than the substring of [math]\displaystyle{ str2 }[/math] starting at position [math]\displaystyle{ N - i + 1 }[/math].
  3. All buckets are empty.

Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].

Break condition: [math]\displaystyle{ N }[/math] iterations are completed.

Induction basis

Abstract view: The auxiliary data must be initialized.

Implementation:

  1. [math]\displaystyle{ S' := \emptyset }[/math].
  2. For each [math]\displaystyle{ c \in \Sigma }[/math], set [math]\displaystyle{ B[c] := \emptyset }[/math].
  3. For each [math]\displaystyle{ i \in \{1,\dots,N\} }[/math], set [math]\displaystyle{ A[i] := \emptyset }[/math].
  4. Move each string [math]\displaystyle{ str }[/math] in [math]\displaystyle{ S }[/math] to [math]\displaystyle{ A[l(str)] }[/math] where [math]\displaystyle{ l(str) }[/math] denotes the length of [math]\displaystyle{ str }[/math].

Proof: Obvious.

Induction step

Abstract view:

  1. Move each string [math]\displaystyle{ str }[/math] in [math]\displaystyle{ A[N-i+1] }[/math] to [math]\displaystyle{ B[str[N-i+1]] }[/math].
  2. Afterwards: Append each string [math]\displaystyle{ str }[/math] in [math]\displaystyle{ S' }[/math] at the tail of [math]\displaystyle{ B[str[N - i + 1]] }[/math]. It is essential that the strings are considered in the order in which they are stored in [math]\displaystyle{ S' }[/math]. (Now [math]\displaystyle{ S' }[/math] is empty.)
  3. For each [math]\displaystyle{ c \in \Sigma }[/math] in ascending order: append [math]\displaystyle{ B[c] }[/math] at the tail of [math]\displaystyle{ S' }[/math] and make [math]\displaystyle{ B[c] }[/math] empty.

Implementation: Obvious.

Correctness: We have to show that the relative order of each pair of strings, [math]\displaystyle{ str1 }[/math] and [math]\displaystyle{ str2 }[/math], in the new sequence [math]\displaystyle{ S' }[/math] is correct. For that, we distinguish four cases:

  1. If [math]\displaystyle{ str1[N-i+1] \neq str2[N-i+1] }[/math], [math]\displaystyle{ str1 }[/math] and [math]\displaystyle{ str2 }[/math] are placed in different buckets in Step 3 of the [math]\displaystyle{ i }[/math]-th iteration, so they are correctly ordered according to the character at position [math]\displaystyle{ N - i + 1 }[/math].
  2. If [math]\displaystyle{ str1[N-i+1] = str2[N-i+1] }[/math] and both strings have length [math]\displaystyle{ N - i + 1 }[/math], their relative order is irrelevant at this stage, so nothing is to show.
  3. Next, consider the case that [math]\displaystyle{ str1[N-i+1] = str2[N-i+1] }[/math], one string (say, [math]\displaystyle{ str1 }[/math]) has length [math]\displaystyle{ N - i + 1 }[/math] and the other one ([math]\displaystyle{ str2 }[/math]) has a length strictly greater than [math]\displaystyle{ N - i + 1 }[/math]. Since the strings of length [math]\displaystyle{ N - i + 1 }[/math] are placed in their buckets prior to the longer strings (cf. Steps 1 and 2), [math]\displaystyle{ str2 }[/math] appears after [math]\displaystyle{ str1 }[/math], which is correct.
  4. Finally, consider the case that [math]\displaystyle{ str1[N-i+1] = str2[N-i+1] }[/math], and the lengths of both strings are larger than [math]\displaystyle{ N - i + 1 }[/math]. Then the induction hypothesis guarantees the correct relative order due to the specific order in which the strings are considered in Step 2.

Complexity

Statement: Let [math]\displaystyle{ M }[/math] denote the total sum of all input string lengths. Then the asymptotic complexity is in [math]\displaystyle{ \Theta(M) }[/math] in the best and the worst case.

Proof: Obviously, the preprocessing takes [math]\displaystyle{ O(M) }[/math] time. In the main loop, each character of each string is read exactly once. Obviously, no operation is applied more often than the reading of single characters.