Simple string matching algorithm: Difference between revisions

Revision as of 10:24, 1 October 2014

Algorithmic problem: One-dimensional string matching

Prerequisites:

Type of algorithm: loop

Auxiliary data A FIFO queue [math]\displaystyle{ I }[/math] of natural numbers, which always contains the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that have been seen only partially thus far.

Abstract view

Invariant: After [math]\displaystyle{ i\ge 0 }[/math] iterations:

In ascending order, [math]\displaystyle{ R }[/math] contains exactly the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie completely in the substring [math]\displaystyle{ (S[1],...,S[i]) }[/math] of [math]\displaystyle{ S }[/math]. In other words, [math]\displaystyle{ R }[/math] contains the start indexes in the range [math]\displaystyle{ S[1],...,S[i-m+1] }[/math], where [math]\displaystyle{ m }[/math] is the length of [math]\displaystyle{ T }[/math].
In ascending order, [math]\displaystyle{ I }[/math] contains exactly the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie partially in the substring [math]\displaystyle{ (S[1],...,S[i]) }[/math] of [math]\displaystyle{ S }[/math]. In other words, [math]\displaystyle{ I }[/math] contains the start indexes in the range [math]\displaystyle{ S[i-m+2],...,S[i] }[/math].

Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].

Break condition: [math]\displaystyle{ i=n }[/math].

Induction basis

Abstract view: [math]\displaystyle{ R }[/math] and [math]\displaystyle{ I }[/math] have to be empty.

Implementation:

[math]\displaystyle{ R:=\emptyset }[/math]
[math]\displaystyle{ I:=\emptyset }[/math]

Proof: Nothing to show.

Induction step

Abstract view:

For each start index in [math]\displaystyle{ I }[/math], we decide whether this is still a candidate after we have seen [math]\displaystyle{ S[i] }[/math]. If not, we remove this start index from [math]\displaystyle{ I }[/math].
Afterwards:
1. If [math]\displaystyle{ S[i]=T[1] }[/math], [math]\displaystyle{ i }[/math] is a new candidate and is thus appended at the tail of [math]\displaystyle{ I }[/math].
2. If start index [math]\displaystyle{ i-m+1 }[/math] is at the head of [math]\displaystyle{ I }[/math], this start index is removed from [math]\displaystyle{ I }[/math] and appended at the tail of [math]\displaystyle{ R }[/math].

Implementation: Obvious.

Correctness:

Due to the second invariant, the induction hypothesis implies that all elements of [math]\displaystyle{ I }[/math] immediately before the [math]\displaystyle{ i }[/math]-th iteration must be from the set [math]\displaystyle{ \{i-m+1,...,i-1\} }[/math]. Therefore, if [math]\displaystyle{ i-m+1 }[/math] is in [math]\displaystyle{ I }[/math], it is the smallest element. Since the elements of [math]\displaystyle{ I }[/math] are in ascending order, [math]\displaystyle{ i-m+1 }[/math] is then at the head of [math]\displaystyle{ I }[/math].
A start index [math]\displaystyle{ j }[/math] in [math]\displaystyle{ I }[/math] is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies [math]\displaystyle{ (S[j],...,S[i-1])=(T[1],...,T[i-j]) }[/math], this is the case if, and only if, [math]\displaystyle{ S[i]\neq T[i-j+1] }[/math].
Clearly, it would be incorrect to transfer any element of [math]\displaystyle{ I }[/math] except [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math]. Thus, we are right to focus on [math]\displaystyle{ i-m+1 }[/math] in Step 2.2. Now, [math]\displaystyle{ i-m+1 }[/math] is in [math]\displaystyle{ I }[/math] if, and only if, (1) it was present immediately before the [math]\displaystyle{ i }[/math]-th iteration and (2) it has survived the first step of the [math]\displaystyle{ i }[/math]-th iteration. The induction hypothesis implies [math]\displaystyle{ (S[i-m+1],...,S[i-1])=(T[1],...,T[m-1]) }[/math], so it is correct to transfer [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math] if, and only if, it is [math]\displaystyle{ S[i]=T[m] }[/math] as well.
Clearly, it would also be incorrect to add any element to [math]\displaystyle{ I }[/math] except [math]\displaystyle{ i }[/math]. So we are right to focus on [math]\displaystyle{ i }[/math] in Step 2.1. Obviously, all elements of [math]\displaystyle{ I }[/math] are less than [math]\displaystyle{ i }[/math], so we are also right to append [math]\displaystyle{ i }[/math] at the tail of [math]\displaystyle{ I }[/math] in case [math]\displaystyle{ S[i]=T[1] }[/math].

Complexity

Further information

@@ Line 29: / Line 29: @@
 ==Induction step==
+'''Abstract view:'''
+# For each start index in <math>I</math>, we decide whether this is still a candidate after we have seen <math>S[i]</math>. If not, we remove this start index from <math>I</math>.
+# Afterwards:
+## If <math>S[i]=T[1]</math>, <math>i</math> is a new candidate and is thus appended at the tail of <math>I</math>.
+## If start index <math>i-m+1</math> is at the head of <math>I</math>, this start index is removed from <math>I</math> and appended at the tail of <math>R</math>.
+'''Implementation:''' Obvious.
+'''Correctness:'''
+# Due to the second invariant, the induction hypothesis implies that all elements of <math>I</math> immediately before the <math>i</math>-th iteration must be from the set <math>\{i-m+1,...,i-1\}</math>. Therefore, if <math>i-m+1</math> is in <math>I</math>, it is the smallest element. Since the elements of <math>I</math> are in ascending order, <math>i-m+1</math> is then at the head of <math>I</math>.
+# A start index <math>j</math> in <math>I</math> is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies <math>(S[j],...,S[i-1])=(T[1],...,T[i-j])</math>, this is the case if, and only if, <math>S[i]\neq T[i-j+1]</math>.
+# Clearly, it would be incorrect to transfer any element of <math>I</math> except <math>i-m+1</math> to <math>R</math>. Thus, we are right to focus on <math>i-m+1</math> in Step 2.2. Now, <math>i-m+1</math> is in <math>I</math> if, and only if, (1) it was present immediately before the <math>i</math>-th iteration and (2) it has survived the first step of the <math>i</math>-th iteration. The induction hypothesis implies <math>(S[i-m+1],...,S[i-1])=(T[1],...,T[m-1])</math>, so it is correct to transfer <math>i-m+1</math> to <math>R</math> if, and only if, it is <math>S[i]=T[m]</math> as well.
+# Clearly, it would also be incorrect to add any element to <math>I</math> except <math>i</math>. So we are right to focus on <math>i</math> in Step 2.1. Obviously, all elements of <math>I</math> are less than <math>i</math>, so we are also right to append <math>i</math> at the tail of <math>I</math> in case <math>S[i]=T[1]</math>.
 ==Complexity==
 ==Further information==