Simple string matching algorithm: Difference between revisions

From Algowiki
Jump to navigation Jump to search
No edit summary
 
(11 intermediate revisions by 3 users not shown)
Line 1: Line 1:
__NOTOC__
[[Category:Videos]]
[[Category:Algorithm]]
[[Category:Algorithm]]
[[Category:Main Algorithm]]
[[Category:Main Algorithm]]
{{#ev:youtube|https://www.youtube.com/watch?v=5p4fZGRaYuo|500|right|Simple string matching|frame}}
'''Algorithmic problem:''' [[One-dimensional string matching]]
'''Algorithmic problem:''' [[One-dimensional string matching]]


'''Prerequisites:'''
==Abstract view==
'''Type of algorithm:''' loop


'''Type of algorithm:''' loop
'''Auxiliary data''' An [[ordered sequence]] <math>I</math> of natural numbers.
 
'''Invariant:''' After <math>i\ge 0</math> iterations:
# In ascending order, <math>R</math> contains exactly the start indexes of all occurrences of <math>T</math> in <math>S</math> that lie completely in the substring <math>(S[1],...,S[i])</math> of <math>S</math>. In other words, <math>R</math> contains the start indexes in the range <math>S[1],...,S[i-m+1]</math>.
# In ascending order, <math>I</math> contains exactly the start indexes of all "candidates," that is, all occurrences of <math>T</math> in <math>S</math> that lie partially in the substring <math>(S[1],...,S[i])</math> of <math>S</math>. In other words, <math>I</math> contains the start indexes in the range <math>S[i-m+2],...,S[i]</math>.


'''Auxiliary data''' A [[FIFO queue]] <math>I</math> of natural numbers, which always contains the start indexes of all occurrences of <math>T</math> in <math>S</math> that have been seen only partially thus far.
'''Variant:''' <math>i</math> increases by <math>1</math>.


==Abstract view==
'''Break condition:''' <math>n</math> iterations completed.


==Induction basis==
==Induction basis==
'''Abstract view:''' <math>R</math> and <math>I</math> have to be empty.
'''Implementation:''' obvious.
'''Proof:''' Nothing to show.


==Induction step==
==Induction step==
'''Abstract view:'''
# For each start index <math>j</math> in <math>I</math>, we decide whether this is still a candidate, that is, whether <math>S[i]=T[i-j+1]</math>. If not, we remove this start index from <math>I</math>.
# Afterwards:
## If <math>S[i]=T[1]</math>, <math>i</math> is a new candidate and is thus appended at the tail of <math>I</math>.
## If <math>I\neq\emptyset</math> and <math>i-m+1</math> is the head element of <math>I</math>, this start index is removed from <math>I</math> and appended at the tail of <math>R</math>.
'''Implementation:''' Obvious.
'''Correctness:'''
# Due to the second invariant, the induction hypothesis implies that all elements of <math>I</math> immediately before the <math>i</math>-th iteration must be from the set <math>\{i-m+1,...,i-1\}</math>. Therefore, if <math>i-m+1</math> is in <math>I</math>, it is the smallest element. Since the elements of <math>I</math> are in ascending order, <math>i-m+1</math> is then at the head of <math>I</math>.
# A start index <math>j</math> in <math>I</math> is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies <math>(S[j],...,S[i-1])=(T[1],...,T[i-j])</math>, this is the case if, and only if, <math>S[i]\neq T[i-j+1]</math>.
# Clearly, it would be incorrect to transfer any element of <math>I</math> except <math>i-m+1</math> to <math>R</math>. Thus, we are right to focus on <math>i-m+1</math> in Step 2.2. Now, <math>i-m+1</math> is in <math>I</math> if, and only if, (1) it was present immediately before the <math>i</math>-th iteration and (2) it has survived the first step of the <math>i</math>-th iteration. The induction hypothesis implies <math>(S[i-m+1],...,S[i-1])=(T[1],...,T[m-1])</math>, so it is correct to transfer <math>i-m+1</math> to <math>R</math> if, and only if, it is <math>S[i]=T[m]</math> as well.
# Clearly, it would also be incorrect to add any element to <math>I</math> except <math>i</math>. So we are right to focus on <math>i</math> in Step 2.1.By induction hypothesis, all elements of <math>I</math> are less than <math>i</math>, so we are also right to append <math>i</math> at the tail of <math>I</math> in case <math>S[i]=T[1]</math>.


==Complexity==
==Complexity==
'''Statement:''' Let <math>r\in \mathbb{N}</math> denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in <math>\mathcal{O}(n\cdot r)\subseteq\mathcal{O}(n\cdot m)</math>.


==Further information==
'''Proof:''' Obviously, the first step of an iteration takes <math>\mathcal{O}(r)</math> time, and the other steps take constant time. The loop has <math>n</math> iterations. Of course, it is <math>r\leq m</math>.

Latest revision as of 10:13, 21 September 2015

Simple string matching

Algorithmic problem: One-dimensional string matching

Abstract view

Type of algorithm: loop

Auxiliary data An ordered sequence I of natural numbers.

Invariant: After i0 iterations:

  1. In ascending order, R contains exactly the start indexes of all occurrences of T in S that lie completely in the substring (S[1],...,S[i]) of S. In other words, R contains the start indexes in the range S[1],...,S[im+1].
  2. In ascending order, I contains exactly the start indexes of all "candidates," that is, all occurrences of T in S that lie partially in the substring (S[1],...,S[i]) of S. In other words, I contains the start indexes in the range S[im+2],...,S[i].

Variant: i increases by 1.

Break condition: n iterations completed.

Induction basis

Abstract view: R and I have to be empty.

Implementation: obvious.

Proof: Nothing to show.

Induction step

Abstract view:

  1. For each start index j in I, we decide whether this is still a candidate, that is, whether S[i]=T[ij+1]. If not, we remove this start index from I.
  2. Afterwards:
    1. If S[i]=T[1], i is a new candidate and is thus appended at the tail of I.
    2. If I and im+1 is the head element of I, this start index is removed from I and appended at the tail of R.

Implementation: Obvious.

Correctness:

  1. Due to the second invariant, the induction hypothesis implies that all elements of I immediately before the i-th iteration must be from the set {im+1,...,i1}. Therefore, if im+1 is in I, it is the smallest element. Since the elements of I are in ascending order, im+1 is then at the head of I.
  2. A start index j in I is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies (S[j],...,S[i1])=(T[1],...,T[ij]), this is the case if, and only if, S[i]T[ij+1].
  3. Clearly, it would be incorrect to transfer any element of I except im+1 to R. Thus, we are right to focus on im+1 in Step 2.2. Now, im+1 is in I if, and only if, (1) it was present immediately before the i-th iteration and (2) it has survived the first step of the i-th iteration. The induction hypothesis implies (S[im+1],...,S[i1])=(T[1],...,T[m1]), so it is correct to transfer im+1 to R if, and only if, it is S[i]=T[m] as well.
  4. Clearly, it would also be incorrect to add any element to I except i. So we are right to focus on i in Step 2.1.By induction hypothesis, all elements of I are less than i, so we are also right to append i at the tail of I in case S[i]=T[1].

Complexity

Statement: Let rN denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in O(nr)O(nm).

Proof: Obviously, the first step of an iteration takes O(r) time, and the other steps take constant time. The loop has n iterations. Of course, it is rm.