String matching based on finite automaton: Difference between revisions
(7 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
__NOTOC__ | __NOTOC__ | ||
[[Category:Checkup]] | |||
[[Category:Algorithm]] | [[Category:Algorithm]] | ||
[[Category:Main Algorithm]] | [[Category:Main Algorithm]] | ||
Line 10: | Line 11: | ||
'''Auxiliary data:''' | '''Auxiliary data:''' | ||
# A current '''maximum prefix length''' <math>q\in \{ 0,...,m\}</math>. | # A current '''maximum prefix length''' <math>q\in \{ 0,...,m\}</math>. | ||
# A look-up table <math>\Delta</math> with rows <math>\{ 0,...,m\}</math> and one column for each character in the alphabet <math>\Sigma</math>. | # A look-up table <math>\Delta</math> with rows <math>\{ 0,...,m\}</math> and one column for each character in the alphabet <math>\Sigma</math>. Required semantics: The value <math>\Delta [j,c]</math> is the length of the longest [[Strings|prefix]] of <math>T</math> that is also a [[Strings|suffix]] of <math>(T[1],...,T[j],c)</math>. In other words, <math>\Delta[j,c]</math> is the next value of <math>q</math> if <math>c</math> is the next character seen in <math>S</math>. | ||
==Abstract view== | ==Abstract view== | ||
'''Invariant:''' After <math>i \ge 0</math>: | '''Invariant:''' After <math>i \ge 0</math> iterations: | ||
# In ascending order, <math>R</math> contains the start indexes of all occurrences of <math>T</math> in <math>S</math> that lie completely in <math>(S[1],...,S[i])</math> | # In ascending order, <math>R</math> contains the start indexes of all occurrences of <math>T</math> in <math>S</math> that lie completely in <math>(S[1],...,S[i])</math>. In other words, <math>R</math> contains the start indexes in the range <math>(1,...,i-m+1])</math>. | ||
# The value of <math>q</math> is the largest value <math>k</math> such that <math>(S[i-k+1],...,S[i])=(T[1],...,T[k])</math>. | # The value of <math>q</math> is the largest value <math>k</math> such that <math>(S[i-k+1],...,S[i])=(T[1],...,T[k])</math>. | ||
To understand the rationale of the second invariant, note that <math>(S[i-k+1],...,S[i])=(T[1],...,T[k])</math> is equivalent to the informal statement that <math>(S[i-k+1],...,S[i])</math> is a candidate for an occurrence of <math>T</math>. In particular, <math>i-q+1</math> is always the "most advanced" candidate (and <math>q=0</math> means that, at the moment, there is no candidate at all). | To understand the rationale of the second invariant, note that <math>(S[i-k+1],...,S[i])=(T[1],...,T[k])</math> is equivalent to the informal statement that <math>(S[i-k+1],...,S[i])</math> is a candidate for an occurrence of <math>T</math>. In particular, <math>i-q+1</math> is always the "most advanced" candidate (and <math>q=0</math> means that, at the moment, there is no candidate at all). | ||
Line 20: | Line 21: | ||
'''Variant:''' <math>i</math> increases by <math>1</math>. | '''Variant:''' <math>i</math> increases by <math>1</math>. | ||
'''Break condition:''' <math> | '''Break condition:''' <math>n</math> iterations completed. | ||
==Induction basis== | ==Induction basis== | ||
Line 26: | Line 27: | ||
# <math>R</math> is empty. | # <math>R</math> is empty. | ||
# The initial value of <math>q</math> is zero. | # The initial value of <math>q</math> is zero. | ||
# The look-up table <math>\Delta</math> must be built. | # The look-up table <math>\Delta</math> must be built according to the required semantics (see above). | ||
'''Implementation:''' | '''Implementation:''' | ||
Line 39: | Line 40: | ||
'''Proof:''' Nothing is to show for <math>R</math> and <math>q</math>, so consider <math>\Delta</math>. We have to show that the above-described intended semantics of <math>\Delta</math> is indeed fulfilled, because these intended semantics of <math>\Delta</math> will be the basis for the correctness proof of the induction step. | '''Proof:''' Nothing is to show for <math>R</math> and <math>q</math>, so consider <math>\Delta</math>. We have to show that the above-described intended semantics of <math>\Delta</math> is indeed fulfilled, because these intended semantics of <math>\Delta</math> will be the basis for the correctness proof of the induction step. | ||
Correctness of Step 3 is | Correctness of Step 3 is easy to see, so consider Step 4. | ||
According to its intended semantics, <math>\Delta [j,c]</math> can neither be larger than <math>m</math> nor larger than <math>j+1</math>; in fact, any string longer than <math>m</math> is not a prefix of <math>T</math>, and any string longer than <math>j+1</math> is not a suffix of <math>(T[1],...,T[j],c)</math>. This observation justifies the initialization of <math>k</math> in Step 4.1. | According to its intended semantics, <math>\Delta [j,c]</math> can neither be larger than <math>m</math> nor larger than <math>j+1</math>; in fact, any string longer than <math>m</math> is not a prefix of <math>T</math>, and any string longer than <math>j+1</math> is not a suffix of <math>(T[1],...,T[j],c)</math>. This observation justifies the initialization of <math>k</math> in Step 4.1. | ||
Now, the countdown of <math>k</math> in Step 4.2 will terminate at the moment when, for the first time, <math>k</math> fulfills <math>(T[1],...,T[k])=(T[j-k+2],...,T[j],c)</math>. Note that this equality is tantamount to the statement that the prefix <math>(T[1],...,T[k])</math> of <math>T</math> is a suffix of <math>(T[j-k+2],...,T[j],c)</math>. Due to the countdown, <math>k</math> is largest, so <math>\Delta [j,c]</math> conforms to its intended semantics. On the other hand, if the countdown terminates with <math>k=0</math>, there was no such non-empty substring, so | Now, the countdown of <math>k</math> in Step 4.2 will terminate at the moment when, for the first time, <math>k</math> fulfills <math>(T[1],...,T[k])=(T[j-k+2],...,T[j],c)</math>. Note that this equality is tantamount to the statement that the prefix <math>(T[1],...,T[k])</math> of <math>T</math> is a suffix of <math>(T[j-k+2],...,T[j],c)</math>. Due to the countdown, <math>k</math> is largest, so <math>\Delta [j,c]</math> conforms to its intended semantics. On the other hand, if the countdown terminates with <math>k=0</math>, there was no such non-empty substring, so Step 4.3 sets <math>\Delta [j,c]=0</math>, which conforms to the intended semantics in this case as well. | ||
==Induction step== | ==Induction step== | ||
Line 59: | Line 60: | ||
==Complexity== | ==Complexity== | ||
'''Statement:''' The worst-case complexity is in <math>\mathcal{O}(m^3 * |\Sigma | + n)</math>. | |||
'''Proof:''' The first summand results from Step 4 of the induction basis. In fact, for each <math>j \in \{ 1,...,m\}</math> and each <math>c \in \Sigma</math>, the value of <math>\Delta [q,c]</math> is to be computed. The countdown to compute <math>\Delta [q,c]</math> takes at most <math>m</math> iterations. For the comparison of the two substrings within one step of the countdown, at most <math>m</math> pairs of characters have to be compared. | |||
The second summand is the asymptotic complexity of the main loop: the number of iterations is at most <math>n</math>, and each iteration takes constant time. | |||
==Further information== | ==Further information== | ||
There is a more sophisticated approach to computing <math>\Delta</math>, which results in <math>\mathcal{O}(m*|\Sigma |)</math> worst-case complexity |
Latest revision as of 14:36, 16 September 2015
Algorithmic problem: One-dimensional string matching
Prerequisites:
Type of algorithm: loop
Auxiliary data:
- A current maximum prefix length [math]\displaystyle{ q\in \{ 0,...,m\} }[/math].
- A look-up table [math]\displaystyle{ \Delta }[/math] with rows [math]\displaystyle{ \{ 0,...,m\} }[/math] and one column for each character in the alphabet [math]\displaystyle{ \Sigma }[/math]. Required semantics: The value [math]\displaystyle{ \Delta [j,c] }[/math] is the length of the longest prefix of [math]\displaystyle{ T }[/math] that is also a suffix of [math]\displaystyle{ (T[1],...,T[j],c) }[/math]. In other words, [math]\displaystyle{ \Delta[j,c] }[/math] is the next value of [math]\displaystyle{ q }[/math] if [math]\displaystyle{ c }[/math] is the next character seen in [math]\displaystyle{ S }[/math].
Abstract view
Invariant: After [math]\displaystyle{ i \ge 0 }[/math] iterations:
- In ascending order, [math]\displaystyle{ R }[/math] contains the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie completely in [math]\displaystyle{ (S[1],...,S[i]) }[/math]. In other words, [math]\displaystyle{ R }[/math] contains the start indexes in the range [math]\displaystyle{ (1,...,i-m+1]) }[/math].
- The value of [math]\displaystyle{ q }[/math] is the largest value [math]\displaystyle{ k }[/math] such that [math]\displaystyle{ (S[i-k+1],...,S[i])=(T[1],...,T[k]) }[/math].
To understand the rationale of the second invariant, note that [math]\displaystyle{ (S[i-k+1],...,S[i])=(T[1],...,T[k]) }[/math] is equivalent to the informal statement that [math]\displaystyle{ (S[i-k+1],...,S[i]) }[/math] is a candidate for an occurrence of [math]\displaystyle{ T }[/math]. In particular, [math]\displaystyle{ i-q+1 }[/math] is always the "most advanced" candidate (and [math]\displaystyle{ q=0 }[/math] means that, at the moment, there is no candidate at all).
Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].
Break condition: [math]\displaystyle{ n }[/math] iterations completed.
Induction basis
Abstract view:
- [math]\displaystyle{ R }[/math] is empty.
- The initial value of [math]\displaystyle{ q }[/math] is zero.
- The look-up table [math]\displaystyle{ \Delta }[/math] must be built according to the required semantics (see above).
Implementation:
- Set [math]\displaystyle{ R:=\emptyset }[/math].
- Set [math]\displaystyle{ q:=0 }[/math].
- For [math]\displaystyle{ c\in \Sigma }[/math]: If [math]\displaystyle{ T[1]=c }[/math], set [math]\displaystyle{ \Delta [0,c]:=1 }[/math]; otherwise, set [math]\displaystyle{ \Delta[0,c]:=0 }[/math].
- For [math]\displaystyle{ j\in \{ 1,...,m\} }[/math] and [math]\displaystyle{ c \in \Sigma }[/math]:
- Set [math]\displaystyle{ k:=\min \{ m,j+1 \} }[/math].
- While [math]\displaystyle{ k\gt 0 }[/math] and [math]\displaystyle{ (T[1],...,T[k]) \neq (T[j-k+2],...,T[j],c) }[/math], decrease [math]\displaystyle{ k }[/math] by [math]\displaystyle{ 1 }[/math].
- Set [math]\displaystyle{ \Delta[j,c]:=k }[/math].
Proof: Nothing is to show for [math]\displaystyle{ R }[/math] and [math]\displaystyle{ q }[/math], so consider [math]\displaystyle{ \Delta }[/math]. We have to show that the above-described intended semantics of [math]\displaystyle{ \Delta }[/math] is indeed fulfilled, because these intended semantics of [math]\displaystyle{ \Delta }[/math] will be the basis for the correctness proof of the induction step.
Correctness of Step 3 is easy to see, so consider Step 4.
According to its intended semantics, [math]\displaystyle{ \Delta [j,c] }[/math] can neither be larger than [math]\displaystyle{ m }[/math] nor larger than [math]\displaystyle{ j+1 }[/math]; in fact, any string longer than [math]\displaystyle{ m }[/math] is not a prefix of [math]\displaystyle{ T }[/math], and any string longer than [math]\displaystyle{ j+1 }[/math] is not a suffix of [math]\displaystyle{ (T[1],...,T[j],c) }[/math]. This observation justifies the initialization of [math]\displaystyle{ k }[/math] in Step 4.1.
Now, the countdown of [math]\displaystyle{ k }[/math] in Step 4.2 will terminate at the moment when, for the first time, [math]\displaystyle{ k }[/math] fulfills [math]\displaystyle{ (T[1],...,T[k])=(T[j-k+2],...,T[j],c) }[/math]. Note that this equality is tantamount to the statement that the prefix [math]\displaystyle{ (T[1],...,T[k]) }[/math] of [math]\displaystyle{ T }[/math] is a suffix of [math]\displaystyle{ (T[j-k+2],...,T[j],c) }[/math]. Due to the countdown, [math]\displaystyle{ k }[/math] is largest, so [math]\displaystyle{ \Delta [j,c] }[/math] conforms to its intended semantics. On the other hand, if the countdown terminates with [math]\displaystyle{ k=0 }[/math], there was no such non-empty substring, so Step 4.3 sets [math]\displaystyle{ \Delta [j,c]=0 }[/math], which conforms to the intended semantics in this case as well.
Induction step
Abstract view:
- The current maximum prefix length [math]\displaystyle{ q }[/math] is to be updated to reflect the situation after reading [math]\displaystyle{ S[i] }[/math].
- If [math]\displaystyle{ q=m }[/math], [math]\displaystyle{ i }[/math] completes an occurrence of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math], so the start index of this occurrence is to be appended to [math]\displaystyle{ R }[/math].
Implementation:
- Set [math]\displaystyle{ q:=\Delta [q,S[i]] }[/math].
- If [math]\displaystyle{ q=m }[/math], append [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math].
Correctness: If the invariant on [math]\displaystyle{ q }[/math] is maintained by Step 1, it is [math]\displaystyle{ q=m }[/math] if, and only if, [math]\displaystyle{ i }[/math] is the last index of an occurrence of [math]\displaystyle{ T }[/math]. This proves correctness of Step 2 and thus maintenance of the first invariant, provided maintenance of the second invariant (the invariant on [math]\displaystyle{ q }[/math]) is proved. So it remains to show that the invariant on [math]\displaystyle{ q }[/math] is indeed maintained.
Recall that the correctness proof of the induction basis has proved that [math]\displaystyle{ \Delta }[/math] fulfills the intended semantics as described above. Therefore, the new value of [math]\displaystyle{ q }[/math] after the [math]\displaystyle{ i }[/math]-th iteration is the largest length [math]\displaystyle{ k }[/math] of a prefix [math]\displaystyle{ (T[1],...,T[k] }[/math] that is identical to [math]\displaystyle{ (T[i-k+2],...,T[i],S[i]) }[/math]. As described above, this is the new "most advanced candidate", so the new value of [math]\displaystyle{ q }[/math] is correct.
Complexity
Statement: The worst-case complexity is in [math]\displaystyle{ \mathcal{O}(m^3 * |\Sigma | + n) }[/math].
Proof: The first summand results from Step 4 of the induction basis. In fact, for each [math]\displaystyle{ j \in \{ 1,...,m\} }[/math] and each [math]\displaystyle{ c \in \Sigma }[/math], the value of [math]\displaystyle{ \Delta [q,c] }[/math] is to be computed. The countdown to compute [math]\displaystyle{ \Delta [q,c] }[/math] takes at most [math]\displaystyle{ m }[/math] iterations. For the comparison of the two substrings within one step of the countdown, at most [math]\displaystyle{ m }[/math] pairs of characters have to be compared.
The second summand is the asymptotic complexity of the main loop: the number of iterations is at most [math]\displaystyle{ n }[/math], and each iteration takes constant time.
Further information
There is a more sophisticated approach to computing [math]\displaystyle{ \Delta }[/math], which results in [math]\displaystyle{ \mathcal{O}(m*|\Sigma |) }[/math] worst-case complexity