String matching based on finite automaton

From Algowiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Algorithmic problem: One-dimensional string matching

Prerequisites:

Type of algorithm: loop

Auxiliary data:

  1. A current maximum prefix length [math]\displaystyle{ q\in \{ 0,...,m\} }[/math].
  2. A look-up table [math]\displaystyle{ \Delta }[/math] with rows [math]\displaystyle{ \{ 0,...,m\} }[/math] and one column for each character in the alphabet [math]\displaystyle{ \Sigma }[/math]. Required semantics: The value [math]\displaystyle{ \Delta [j,c] }[/math] is the length of the longest prefix of [math]\displaystyle{ T }[/math] that is also a suffix of [math]\displaystyle{ (T[1],...,T[j],c) }[/math]. In other words, [math]\displaystyle{ \Delta[j,c] }[/math] is the next value of [math]\displaystyle{ q }[/math] if [math]\displaystyle{ c }[/math] is the next character seen in [math]\displaystyle{ S }[/math].

Abstract view

Invariant: After [math]\displaystyle{ i \ge 0 }[/math] iterations:

  1. In ascending order, [math]\displaystyle{ R }[/math] contains the start indexes of all occurrences of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math] that lie completely in [math]\displaystyle{ (S[1],...,S[i]) }[/math]. In other words, [math]\displaystyle{ R }[/math] contains the start indexes in the range [math]\displaystyle{ (1,...,i-m+1]) }[/math].
  2. The value of [math]\displaystyle{ q }[/math] is the largest value [math]\displaystyle{ k }[/math] such that [math]\displaystyle{ (S[i-k+1],...,S[i])=(T[1],...,T[k]) }[/math].

To understand the rationale of the second invariant, note that [math]\displaystyle{ (S[i-k+1],...,S[i])=(T[1],...,T[k]) }[/math] is equivalent to the informal statement that [math]\displaystyle{ (S[i-k+1],...,S[i]) }[/math] is a candidate for an occurrence of [math]\displaystyle{ T }[/math]. In particular, [math]\displaystyle{ i-q+1 }[/math] is always the "most advanced" candidate (and [math]\displaystyle{ q=0 }[/math] means that, at the moment, there is no candidate at all).

Variant: [math]\displaystyle{ i }[/math] increases by [math]\displaystyle{ 1 }[/math].

Break condition: [math]\displaystyle{ n }[/math] iterations completed.

Induction basis

Abstract view:

  1. [math]\displaystyle{ R }[/math] is empty.
  2. The initial value of [math]\displaystyle{ q }[/math] is zero.
  3. The look-up table [math]\displaystyle{ \Delta }[/math] must be built according to the required semantics (see above).

Implementation:

  1. Set [math]\displaystyle{ R:=\emptyset }[/math].
  2. Set [math]\displaystyle{ q:=0 }[/math].
  3. For [math]\displaystyle{ c\in \Sigma }[/math]: If [math]\displaystyle{ T[1]=c }[/math], set [math]\displaystyle{ \Delta [0,c]:=1 }[/math]; otherwise, set [math]\displaystyle{ \Delta[0,c]:=0 }[/math].
  4. For [math]\displaystyle{ j\in \{ 1,...,m\} }[/math] and [math]\displaystyle{ c \in \Sigma }[/math]:
    1. Set [math]\displaystyle{ k:=\min \{ m,j+1 \} }[/math].
    2. While [math]\displaystyle{ k\gt 0 }[/math] and [math]\displaystyle{ (T[1],...,T[k]) \neq (T[j-k+2],...,T[j],c) }[/math], decrease [math]\displaystyle{ k }[/math] by [math]\displaystyle{ 1 }[/math].
    3. Set [math]\displaystyle{ \Delta[j,c]:=k }[/math].

Proof: Nothing is to show for [math]\displaystyle{ R }[/math] and [math]\displaystyle{ q }[/math], so consider [math]\displaystyle{ \Delta }[/math]. We have to show that the above-described intended semantics of [math]\displaystyle{ \Delta }[/math] is indeed fulfilled, because these intended semantics of [math]\displaystyle{ \Delta }[/math] will be the basis for the correctness proof of the induction step.

Correctness of Step 3 is easy to see, so consider Step 4.

According to its intended semantics, [math]\displaystyle{ \Delta [j,c] }[/math] can neither be larger than [math]\displaystyle{ m }[/math] nor larger than [math]\displaystyle{ j+1 }[/math]; in fact, any string longer than [math]\displaystyle{ m }[/math] is not a prefix of [math]\displaystyle{ T }[/math], and any string longer than [math]\displaystyle{ j+1 }[/math] is not a suffix of [math]\displaystyle{ (T[1],...,T[j],c) }[/math]. This observation justifies the initialization of [math]\displaystyle{ k }[/math] in Step 4.1.

Now, the countdown of [math]\displaystyle{ k }[/math] in Step 4.2 will terminate at the moment when, for the first time, [math]\displaystyle{ k }[/math] fulfills [math]\displaystyle{ (T[1],...,T[k])=(T[j-k+2],...,T[j],c) }[/math]. Note that this equality is tantamount to the statement that the prefix [math]\displaystyle{ (T[1],...,T[k]) }[/math] of [math]\displaystyle{ T }[/math] is a suffix of [math]\displaystyle{ (T[j-k+2],...,T[j],c) }[/math]. Due to the countdown, [math]\displaystyle{ k }[/math] is largest, so [math]\displaystyle{ \Delta [j,c] }[/math] conforms to its intended semantics. On the other hand, if the countdown terminates with [math]\displaystyle{ k=0 }[/math], there was no such non-empty substring, so Step 4.3 sets [math]\displaystyle{ \Delta [j,c]=0 }[/math], which conforms to the intended semantics in this case as well.

Induction step

Abstract view:

  1. The current maximum prefix length [math]\displaystyle{ q }[/math] is to be updated to reflect the situation after reading [math]\displaystyle{ S[i] }[/math].
  2. If [math]\displaystyle{ q=m }[/math], [math]\displaystyle{ i }[/math] completes an occurrence of [math]\displaystyle{ T }[/math] in [math]\displaystyle{ S }[/math], so the start index of this occurrence is to be appended to [math]\displaystyle{ R }[/math].

Implementation:

  1. Set [math]\displaystyle{ q:=\Delta [q,S[i]] }[/math].
  2. If [math]\displaystyle{ q=m }[/math], append [math]\displaystyle{ i-m+1 }[/math] to [math]\displaystyle{ R }[/math].

Correctness: If the invariant on [math]\displaystyle{ q }[/math] is maintained by Step 1, it is [math]\displaystyle{ q=m }[/math] if, and only if, [math]\displaystyle{ i }[/math] is the last index of an occurrence of [math]\displaystyle{ T }[/math]. This proves correctness of Step 2 and thus maintenance of the first invariant, provided maintenance of the second invariant (the invariant on [math]\displaystyle{ q }[/math]) is proved. So it remains to show that the invariant on [math]\displaystyle{ q }[/math] is indeed maintained.

Recall that the correctness proof of the induction basis has proved that [math]\displaystyle{ \Delta }[/math] fulfills the intended semantics as described above. Therefore, the new value of [math]\displaystyle{ q }[/math] after the [math]\displaystyle{ i }[/math]-th iteration is the largest length [math]\displaystyle{ k }[/math] of a prefix [math]\displaystyle{ (T[1],...,T[k] }[/math] that is identical to [math]\displaystyle{ (T[i-k+2],...,T[i],S[i]) }[/math]. As described above, this is the new "most advanced candidate", so the new value of [math]\displaystyle{ q }[/math] is correct.

Complexity

Statement: The worst-case complexity is in [math]\displaystyle{ \mathcal{O}(m^3 * |\Sigma | + n) }[/math].

Proof: The first summand results from Step 4 of the induction basis. In fact, for each [math]\displaystyle{ j \in \{ 1,...,m\} }[/math] and each [math]\displaystyle{ c \in \Sigma }[/math], the value of [math]\displaystyle{ \Delta [q,c] }[/math] is to be computed. The countdown to compute [math]\displaystyle{ \Delta [q,c] }[/math] takes at most [math]\displaystyle{ m }[/math] iterations. For the comparison of the two substrings within one step of the countdown, at most [math]\displaystyle{ m }[/math] pairs of characters have to be compared.

The second summand is the asymptotic complexity of the main loop: the number of iterations is at most [math]\displaystyle{ n }[/math], and each iteration takes constant time.

Further information

There is a more sophisticated approach to computing [math]\displaystyle{ \Delta }[/math], which results in [math]\displaystyle{ \mathcal{O}(m*|\Sigma |) }[/math] worst-case complexity