String matching based on finite automaton

From Algowiki
Jump to: navigation, search

Algorithmic problem: One-dimensional string matching

Prerequisites:

Type of algorithm: loop

Auxiliary data:

  1. A current maximum prefix length [math]q\in \{ 0,...,m\}[/math].
  2. A look-up table [math]\Delta[/math] with rows [math]\{ 0,...,m\}[/math] and one column for each character in the alphabet [math]\Sigma[/math]. Required semantics: The value [math]\Delta [j,c][/math] is the length of the longest prefix of [math]T[/math] that is also a suffix of [math](T[1],...,T[j],c)[/math]. In other words, [math]\Delta[j,c][/math] is the next value of [math]q[/math] if [math]c[/math] is the next character seen in [math]S[/math].

Abstract view

Invariant: After [math]i \ge 0[/math] iterations:

  1. In ascending order, [math]R[/math] contains the start indexes of all occurrences of [math]T[/math] in [math]S[/math] that lie completely in [math](S[1],...,S[i])[/math]. In other words, [math]R[/math] contains the start indexes in the range [math](1,...,i-m+1])[/math].
  2. The value of [math]q[/math] is the largest value [math]k[/math] such that [math](S[i-k+1],...,S[i])=(T[1],...,T[k])[/math].

To understand the rationale of the second invariant, note that [math](S[i-k+1],...,S[i])=(T[1],...,T[k])[/math] is equivalent to the informal statement that [math](S[i-k+1],...,S[i])[/math] is a candidate for an occurrence of [math]T[/math]. In particular, [math]i-q+1[/math] is always the "most advanced" candidate (and [math]q=0[/math] means that, at the moment, there is no candidate at all).

Variant: [math]i[/math] increases by [math]1[/math].

Break condition: [math]n[/math] iterations completed.

Induction basis

Abstract view:

  1. [math]R[/math] is empty.
  2. The initial value of [math]q[/math] is zero.
  3. The look-up table [math]\Delta[/math] must be built according to the required semantics (see above).

Implementation:

  1. Set [math]R:=\emptyset[/math].
  2. Set [math]q:=0[/math].
  3. For [math]c\in \Sigma[/math]: If [math]T[1]=c[/math], set [math]\Delta [0,c]:=1[/math]; otherwise, set [math]\Delta[0,c]:=0[/math].
  4. For [math]j\in \{ 1,...,m\}[/math] and [math]c \in \Sigma[/math]:
    1. Set [math]k:=\min \{ m,j+1 \}[/math].
    2. While [math]k\gt 0[/math] and [math](T[1],...,T[k]) \neq (T[j-k+2],...,T[j],c)[/math], decrease [math]k[/math] by [math]1[/math].
    3. Set [math]\Delta[j,c]:=k[/math].

Proof: Nothing is to show for [math]R[/math] and [math]q[/math], so consider [math]\Delta[/math]. We have to show that the above-described intended semantics of [math]\Delta[/math] is indeed fulfilled, because these intended semantics of [math]\Delta[/math] will be the basis for the correctness proof of the induction step.

Correctness of Step 3 is easy to see, so consider Step 4.

According to its intended semantics, [math]\Delta [j,c][/math] can neither be larger than [math]m[/math] nor larger than [math]j+1[/math]; in fact, any string longer than [math]m[/math] is not a prefix of [math]T[/math], and any string longer than [math]j+1[/math] is not a suffix of [math](T[1],...,T[j],c)[/math]. This observation justifies the initialization of [math]k[/math] in Step 4.1.

Now, the countdown of [math]k[/math] in Step 4.2 will terminate at the moment when, for the first time, [math]k[/math] fulfills [math](T[1],...,T[k])=(T[j-k+2],...,T[j],c)[/math]. Note that this equality is tantamount to the statement that the prefix [math](T[1],...,T[k])[/math] of [math]T[/math] is a suffix of [math](T[j-k+2],...,T[j],c)[/math]. Due to the countdown, [math]k[/math] is largest, so [math]\Delta [j,c][/math] conforms to its intended semantics. On the other hand, if the countdown terminates with [math]k=0[/math], there was no such non-empty substring, so Step 4.3 sets [math]\Delta [j,c]=0[/math], which conforms to the intended semantics in this case as well.

Induction step

Abstract view:

  1. The current maximum prefix length [math]q[/math] is to be updated to reflect the situation after reading [math]S[i][/math].
  2. If [math]q=m[/math], [math]i[/math] completes an occurrence of [math]T[/math] in [math]S[/math], so the start index of this occurrence is to be appended to [math]R[/math].

Implementation:

  1. Set [math]q:=\Delta [q,S[i]][/math].
  2. If [math]q=m[/math], append [math]i-m+1[/math] to [math]R[/math].

Correctness: If the invariant on [math]q[/math] is maintained by Step 1, it is [math]q=m[/math] if, and only if, [math]i[/math] is the last index of an occurrence of [math]T[/math]. This proves correctness of Step 2 and thus maintenance of the first invariant, provided maintenance of the second invariant (the invariant on [math]q[/math]) is proved. So it remains to show that the invariant on [math]q[/math] is indeed maintained.

Recall that the correctness proof of the induction basis has proved that [math]\Delta[/math] fulfills the intended semantics as described above. Therefore, the new value of [math]q[/math] after the [math]i[/math]-th iteration is the largest length [math]k[/math] of a prefix [math](T[1],...,T[k][/math] that is identical to [math](T[i-k+2],...,T[i],S[i])[/math]. As described above, this is the new "most advanced candidate", so the new value of [math]q[/math] is correct.

Complexity

Statement: The worst-case complexity is in [math]\mathcal{O}(m^3 * |\Sigma | + n)[/math].

Proof: The first summand results from Step 4 of the induction basis. In fact, for each [math]j \in \{ 1,...,m\}[/math] and each [math]c \in \Sigma[/math], the value of [math]\Delta [q,c][/math] is to be computed. The countdown to compute [math]\Delta [q,c][/math] takes at most [math]m[/math] iterations. For the comparison of the two substrings within one step of the countdown, at most [math]m[/math] pairs of characters have to be compared.

The second summand is the asymptotic complexity of the main loop: the number of iterations is at most [math]n[/math], and each iteration takes constant time.

Further information

There is a more sophisticated approach to computing [math]\Delta[/math], which results in [math]\mathcal{O}(m*|\Sigma |)[/math] worst-case complexity