Simple string matching algorithm: Difference between revisions

From Algowiki
Jump to navigation Jump to search
No edit summary
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOTOC__
[[Category:Videos]]
[[Category:Checkup]]
[[Category:Algorithm]]
[[Category:Algorithm]]
[[Category:Main Algorithm]]
[[Category:Main Algorithm]]
{{#ev:youtube|https://www.youtube.com/watch?v=5p4fZGRaYuo|500|right|Simple string matching|frame}}
'''Algorithmic problem:''' [[One-dimensional string matching]]
'''Algorithmic problem:''' [[One-dimensional string matching]]


Line 8: Line 8:
'''Type of algorithm:''' loop
'''Type of algorithm:''' loop


'''Auxiliary data''' A [[Sets and sequences#Stacks and queues|FIFO queue]] <math>I</math> of natural numbers.
'''Auxiliary data''' An [[ordered sequence]] <math>I</math> of natural numbers.


'''Invariant:''' After <math>i\ge 0</math> iterations:
'''Invariant:''' After <math>i\ge 0</math> iterations:
Line 16: Line 16:
'''Variant:''' <math>i</math> increases by <math>1</math>.
'''Variant:''' <math>i</math> increases by <math>1</math>.


'''Break condition:''' <math>i=n</math>.
'''Break condition:''' <math>n</math> iterations completed.


==Induction basis==
==Induction basis==
Line 30: Line 30:
# Afterwards:
# Afterwards:
## If <math>S[i]=T[1]</math>, <math>i</math> is a new candidate and is thus appended at the tail of <math>I</math>.
## If <math>S[i]=T[1]</math>, <math>i</math> is a new candidate and is thus appended at the tail of <math>I</math>.
## If <math>I\neq\emptyset</math> and start index <math>i-m+1</math> is the head element of <math>I</math>, this start index is removed from <math>I</math> and appended at the tail of <math>R</math>.
## If <math>I\neq\emptyset</math> and <math>i-m+1</math> is the head element of <math>I</math>, this start index is removed from <math>I</math> and appended at the tail of <math>R</math>.


'''Implementation:''' Obvious.
'''Implementation:''' Obvious.

Latest revision as of 10:13, 21 September 2015

Simple string matching

Algorithmic problem: One-dimensional string matching

Abstract view

Type of algorithm: loop

Auxiliary data An ordered sequence [math]\displaystyle{ I }[/math] of natural numbers.

Invariant: After [math]\displaystyle{ i\ge 0 }[/math] iterations:

  1. In ascending order, [math]\displaystyle{ R }[/math] contains exactly the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie completely in the substring [math]\displaystyle{ (S[1],...,S[i]) }[/math] of [math]\displaystyle{ S }[/math]. In other words, [math]\displaystyle{ R }[/math] contains the start indexes in the range [math]\displaystyle{ S[1],...,S[i-m+1] }[/math].
  2. In ascending order, [math]\displaystyle{ I }[/math] contains exactly the start indexes of all "candidates," that is, all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie partially in the substring [math]\displaystyle{ (S[1],...,S[i]) }[/math] of [math]\displaystyle{ S }[/math]. In other words, [math]\displaystyle{ I }[/math] contains the start indexes in the range [math]\displaystyle{ S[i-m+2],...,S[i] }[/math].

Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].

Break condition: [math]\displaystyle{ n }[/math] iterations completed.

Induction basis

Abstract view: [math]\displaystyle{ R }[/math] and [math]\displaystyle{ I }[/math] have to be empty.

Implementation: obvious.

Proof: Nothing to show.

Induction step

Abstract view:

  1. For each start index [math]\displaystyle{ j }[/math] in [math]\displaystyle{ I }[/math], we decide whether this is still a candidate, that is, whether [math]\displaystyle{ S[i]=T[i-j+1] }[/math]. If not, we remove this start index from [math]\displaystyle{ I }[/math].
  2. Afterwards:
    1. If [math]\displaystyle{ S[i]=T[1] }[/math], [math]\displaystyle{ i }[/math] is a new candidate and is thus appended at the tail of [math]\displaystyle{ I }[/math].
    2. If [math]\displaystyle{ I\neq\emptyset }[/math] and [math]\displaystyle{ i-m+1 }[/math] is the head element of [math]\displaystyle{ I }[/math], this start index is removed from [math]\displaystyle{ I }[/math] and appended at the tail of [math]\displaystyle{ R }[/math].

Implementation: Obvious.

Correctness:

  1. Due to the second invariant, the induction hypothesis implies that all elements of [math]\displaystyle{ I }[/math] immediately before the [math]\displaystyle{ i }[/math]-th iteration must be from the set [math]\displaystyle{ \{i-m+1,...,i-1\} }[/math]. Therefore, if [math]\displaystyle{ i-m+1 }[/math] is in [math]\displaystyle{ I }[/math], it is the smallest element. Since the elements of [math]\displaystyle{ I }[/math] are in ascending order, [math]\displaystyle{ i-m+1 }[/math] is then at the head of [math]\displaystyle{ I }[/math].
  2. A start index [math]\displaystyle{ j }[/math] in [math]\displaystyle{ I }[/math] is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies [math]\displaystyle{ (S[j],...,S[i-1])=(T[1],...,T[i-j]) }[/math], this is the case if, and only if, [math]\displaystyle{ S[i]\neq T[i-j+1] }[/math].
  3. Clearly, it would be incorrect to transfer any element of [math]\displaystyle{ I }[/math] except [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math]. Thus, we are right to focus on [math]\displaystyle{ i-m+1 }[/math] in Step 2.2. Now, [math]\displaystyle{ i-m+1 }[/math] is in [math]\displaystyle{ I }[/math] if, and only if, (1) it was present immediately before the [math]\displaystyle{ i }[/math]-th iteration and (2) it has survived the first step of the [math]\displaystyle{ i }[/math]-th iteration. The induction hypothesis implies [math]\displaystyle{ (S[i-m+1],...,S[i-1])=(T[1],...,T[m-1]) }[/math], so it is correct to transfer [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math] if, and only if, it is [math]\displaystyle{ S[i]=T[m] }[/math] as well.
  4. Clearly, it would also be incorrect to add any element to [math]\displaystyle{ I }[/math] except [math]\displaystyle{ i }[/math]. So we are right to focus on [math]\displaystyle{ i }[/math] in Step 2.1.By induction hypothesis, all elements of [math]\displaystyle{ I }[/math] are less than [math]\displaystyle{ i }[/math], so we are also right to append [math]\displaystyle{ i }[/math] at the tail of [math]\displaystyle{ I }[/math] in case [math]\displaystyle{ S[i]=T[1] }[/math].

Complexity

Statement: Let [math]\displaystyle{ r\in \mathbb{N} }[/math] denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in [math]\displaystyle{ \mathcal{O}(n\cdot r)\subseteq\mathcal{O}(n\cdot m) }[/math].

Proof: Obviously, the first step of an iteration takes [math]\displaystyle{ \mathcal{O}(r) }[/math] time, and the other steps take constant time. The loop has [math]\displaystyle{ n }[/math] iterations. Of course, it is [math]\displaystyle{ r\leq m }[/math].