String matching based on finite automaton: Difference between revisions

Revision as of 14:00, 1 October 2014

Algorithmic problem: One-dimensional string matching

Prerequisites:

Type of algorithm: loop

Auxiliary data:

A current maximum prefix length [math]\displaystyle{ q\in \{ 0,...,m\} }[/math].
A look-up table [math]\displaystyle{ \Delta }[/math] with rows [math]\displaystyle{ \{ 0,...,m\} }[/math] and one column for each character in the alphabet [math]\displaystyle{ \Sigma }[/math]. Intended semantics: The value [math]\displaystyle{ \Delta [j,c] }[/math] is the length of the longest prefix of [math]\displaystyle{ T }[/math] that is also a suffix of [math]\displaystyle{ (T[1],...,T[j],c) }[/math].

Abstract view

Invariant: After [math]\displaystyle{ i \ge 0 }[/math]:

In ascending order, [math]\displaystyle{ R }[/math] contains the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie completely in [math]\displaystyle{ (S[1],...,S[i]) }[/math]; in other words, the start indexes in the range [math]\displaystyle{ (1,...,i-m+1]) }[/math].
The value of [math]\displaystyle{ q }[/math] is the largest value [math]\displaystyle{ k }[/math] such that [math]\displaystyle{ (S[i-k+1],...,S[i])=(T[1],...,T[k]) }[/math].

To understand the rationale of the second invariant, note that [math]\displaystyle{ (S[i-k+1],...,S[i])=(T[1],...,T[k]) }[/math] is equivalent to the informal statement that [math]\displaystyle{ (S[i-k+1],...,S[i]) }[/math] is a candidate for an occurrence of [math]\displaystyle{ T }[/math]. In particular, [math]\displaystyle{ i-q+1 }[/math] is always the "most advanced" candidate (and [math]\displaystyle{ q=0 }[/math] means that, at the moment, there is no candidate at all).

Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].

Break condition: [math]\displaystyle{ i=n }[/math].

Induction basis

Abstract view:

[math]\displaystyle{ R }[/math] is empty.
The initial value of [math]\displaystyle{ q }[/math] is zero.
The look-up table [math]\displaystyle{ \Delta }[/math] must be built.

Implementation:

Set [math]\displaystyle{ R:=\emptyset }[/math].
Set [math]\displaystyle{ q:=0 }[/math].
For [math]\displaystyle{ c\in \Sigma }[/math]: If [math]\displaystyle{ T[1]=c }[/math], set [math]\displaystyle{ \Delta [0,c]:=1 }[/math]; otherwise, set [math]\displaystyle{ \Delta[0,c]:=0 }[/math].
For [math]\displaystyle{ j\in \{ 1,...,m\} }[/math] and [math]\displaystyle{ c \in \Sigma }[/math]:
1. Set [math]\displaystyle{ k:=\min \{ m,j+1 \} }[/math].
2. While [math]\displaystyle{ k\gt 0 }[/math] and [math]\displaystyle{ (T[1],...,T[k]) \neq (T[j-k+2],...,T[j],c) }[/math], decrease [math]\displaystyle{ k }[/math] by [math]\displaystyle{ 1 }[/math].
3. Set [math]\displaystyle{ \Delta[j,c]:=k }[/math].

Proof: Nothing is to show for [math]\displaystyle{ R }[/math] and [math]\displaystyle{ q }[/math], so consider [math]\displaystyle{ \Delta }[/math]. We have to show that the above-described intended semantics of [math]\displaystyle{ \Delta }[/math] is indeed fulfilled, because these intended semantics of [math]\displaystyle{ \Delta }[/math] will be the basis for the correctness proof of the induction step.

Correctness of Step 3 is also easy to see, so consider Step 4.

According to its intended semantics, [math]\displaystyle{ \Delta [j,c] }[/math] can neither be larger than [math]\displaystyle{ m }[/math] nor larger than [math]\displaystyle{ j+1 }[/math]; in fact, any string longer than [math]\displaystyle{ m }[/math] is not a prefix of [math]\displaystyle{ T }[/math], and any string longer than [math]\displaystyle{ j+1 }[/math] is not a suffix of [math]\displaystyle{ (T[1],...,T[j],c) }[/math]. This observation justifies the initialization of [math]\displaystyle{ k }[/math] in Step 4.1.

Now, the countdown of [math]\displaystyle{ k }[/math] in Step 4.2 will terminate at the moment when, for the first time, [math]\displaystyle{ k }[/math] fulfills [math]\displaystyle{ (T[1],...,T[k])=(T[j-k+2],...,T[j],c) }[/math]. Note that this equality is tantamount to the statement that the prefix [math]\displaystyle{ (T[1],...,T[k]) }[/math] of [math]\displaystyle{ T }[/math] is a suffix of [math]\displaystyle{ (T[j-k+2],...,T[j],c) }[/math]. Due to the countdown, [math]\displaystyle{ k }[/math] is largest, so [math]\displaystyle{ \Delta [j,c] }[/math] conforms to its intended semantics. On the other hand, if the countdown terminates with [math]\displaystyle{ k=0 }[/math], there was no such non-empty substring, so it is [math]\displaystyle{ \Delta [j,c]=0 }[/math], which conforms to the intended semantics in this case as well.

Induction step

Abstract view:

The current maximum prefix length [math]\displaystyle{ q }[/math] is to be updated to reflect the situation after reading [math]\displaystyle{ S[i] }[/math].
If [math]\displaystyle{ q=m }[/math], [math]\displaystyle{ i }[/math] completes an occurrence of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math], so the start index of this occurrence is to be appended to [math]\displaystyle{ R }[/math].

Implementation:

Set [math]\displaystyle{ q:=\Delta [q,S[i]] }[/math].
If [math]\displaystyle{ q=m }[/math], append [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math].

Correctness: If the invariant on [math]\displaystyle{ q }[/math] is maintained by Step 1, it is [math]\displaystyle{ q=m }[/math] if, and only if, [math]\displaystyle{ i }[/math] is the last index of an occurrence of [math]\displaystyle{ T }[/math]. This proves correctness of Step 2 and thus maintenance of the first invariant, provided maintenance of the second invariant (the invariant on [math]\displaystyle{ q }[/math]) is proved. So it remains to show that the invariant on [math]\displaystyle{ q }[/math] is indeed maintained.

Recall that the correctness proof of the induction basis has proved that [math]\displaystyle{ \Delta }[/math] fulfills the intended semantics as described above. Therefore, the new value of [math]\displaystyle{ q }[/math] after the [math]\displaystyle{ i }[/math]-th iteration is the largest length [math]\displaystyle{ k }[/math] of a prefix [math]\displaystyle{ (T[1],...,T[k] }[/math] that is identical to [math]\displaystyle{ (T[i-k+2],...,T[i],S[i]) }[/math]. As described above, this is the new "most advanced candidate", so the new value of [math]\displaystyle{ q }[/math] is correct.

Complexity

Further information

@@ Line 46: / Line 46: @@
 ==Induction step==
+'''Abstract view:'''
+# The current maximum prefix length <math>q</math> is to be updated to reflect the situation after reading <math>S[i]</math>.
+# If <math>q=m</math>, <math>i</math> completes an occurrence of <math>T</math> in <math>S</math>, so the start index of this occurrence is to be appended to <math>R</math>.
+'''Implementation:'''
+# Set <math>q:=\Delta [q,S[i]]</math>.
+# If <math>q=m</math>, append <math>i-m+1</math> to <math>R</math>.
+'''Correctness:''' If the invariant on <math>q</math> is maintained by Step 1, it is <math>q=m</math> if, and only if, <math>i</math> is the last index of an occurrence of <math>T</math>. This proves correctness of Step 2 and thus maintenance of the first invariant, provided maintenance of the second invariant (the invariant on <math>q</math>) is proved. So it remains to show that the invariant on <math>q</math> is indeed maintained.
+Recall that the correctness proof of the induction basis has proved that <math>\Delta</math> fulfills the intended semantics as described above. Therefore, the new value of <math>q</math> after the <math>i</math>-th iteration is the largest length <math>k</math> of a prefix <math>(T[1],...,T[k]</math> that is identical to <math>(T[i-k+2],...,T[i],S[i])</math>. As described above, this is the new "most advanced candidate", so the new value of <math>q</math> is correct.
 ==Complexity==
 ==Further information==