# String matching based on finite automaton

Algorithmic problem: One-dimensional string matching

Prerequisites:

Type of algorithm: loop

Auxiliary data:

1. A current maximum prefix length $q\in \{ 0,...,m\}$.
2. A look-up table $\Delta$ with rows $\{ 0,...,m\}$ and one column for each character in the alphabet $\Sigma$. Required semantics: The value $\Delta [j,c]$ is the length of the longest prefix of $T$ that is also a suffix of $(T[1],...,T[j],c)$. In other words, $\Delta[j,c]$ is the next value of $q$ if $c$ is the next character seen in $S$.

## Abstract view

Invariant: After $i \ge 0$ iterations:

1. In ascending order, $R$ contains the start indexes of all occurrences of $T$ in $S$ that lie completely in $(S[1],...,S[i])$. In other words, $R$ contains the start indexes in the range $(1,...,i-m+1])$.
2. The value of $q$ is the largest value $k$ such that $(S[i-k+1],...,S[i])=(T[1],...,T[k])$.

To understand the rationale of the second invariant, note that $(S[i-k+1],...,S[i])=(T[1],...,T[k])$ is equivalent to the informal statement that $(S[i-k+1],...,S[i])$ is a candidate for an occurrence of $T$. In particular, $i-q+1$ is always the "most advanced" candidate (and $q=0$ means that, at the moment, there is no candidate at all).

Variant: $i$ increases by $1$.

Break condition: $n$ iterations completed.

## Induction basis

Abstract view:

1. $R$ is empty.
2. The initial value of $q$ is zero.
3. The look-up table $\Delta$ must be built according to the required semantics (see above).

Implementation:

1. Set $R:=\emptyset$.
2. Set $q:=0$.
3. For $c\in \Sigma$: If $T[1]=c$, set $\Delta [0,c]:=1$; otherwise, set $\Delta[0,c]:=0$.
4. For $j\in \{ 1,...,m\}$ and $c \in \Sigma$:
1. Set $k:=\min \{ m,j+1 \}$.
2. While $k\gt 0$ and $(T[1],...,T[k]) \neq (T[j-k+2],...,T[j],c)$, decrease $k$ by $1$.
3. Set $\Delta[j,c]:=k$.

Proof: Nothing is to show for $R$ and $q$, so consider $\Delta$. We have to show that the above-described intended semantics of $\Delta$ is indeed fulfilled, because these intended semantics of $\Delta$ will be the basis for the correctness proof of the induction step.

Correctness of Step 3 is easy to see, so consider Step 4.

According to its intended semantics, $\Delta [j,c]$ can neither be larger than $m$ nor larger than $j+1$; in fact, any string longer than $m$ is not a prefix of $T$, and any string longer than $j+1$ is not a suffix of $(T[1],...,T[j],c)$. This observation justifies the initialization of $k$ in Step 4.1.

Now, the countdown of $k$ in Step 4.2 will terminate at the moment when, for the first time, $k$ fulfills $(T[1],...,T[k])=(T[j-k+2],...,T[j],c)$. Note that this equality is tantamount to the statement that the prefix $(T[1],...,T[k])$ of $T$ is a suffix of $(T[j-k+2],...,T[j],c)$. Due to the countdown, $k$ is largest, so $\Delta [j,c]$ conforms to its intended semantics. On the other hand, if the countdown terminates with $k=0$, there was no such non-empty substring, so Step 4.3 sets $\Delta [j,c]=0$, which conforms to the intended semantics in this case as well.

## Induction step

Abstract view:

1. The current maximum prefix length $q$ is to be updated to reflect the situation after reading $S[i]$.
2. If $q=m$, $i$ completes an occurrence of $T$ in $S$, so the start index of this occurrence is to be appended to $R$.

Implementation:

1. Set $q:=\Delta [q,S[i]]$.
2. If $q=m$, append $i-m+1$ to $R$.

Correctness: If the invariant on $q$ is maintained by Step 1, it is $q=m$ if, and only if, $i$ is the last index of an occurrence of $T$. This proves correctness of Step 2 and thus maintenance of the first invariant, provided maintenance of the second invariant (the invariant on $q$) is proved. So it remains to show that the invariant on $q$ is indeed maintained.

Recall that the correctness proof of the induction basis has proved that $\Delta$ fulfills the intended semantics as described above. Therefore, the new value of $q$ after the $i$-th iteration is the largest length $k$ of a prefix $(T[1],...,T[k]$ that is identical to $(T[i-k+2],...,T[i],S[i])$. As described above, this is the new "most advanced candidate", so the new value of $q$ is correct.

## Complexity

Statement: The worst-case complexity is in $\mathcal{O}(m^3 * |\Sigma | + n)$.

Proof: The first summand results from Step 4 of the induction basis. In fact, for each $j \in \{ 1,...,m\}$ and each $c \in \Sigma$, the value of $\Delta [q,c]$ is to be computed. The countdown to compute $\Delta [q,c]$ takes at most $m$ iterations. For the comparison of the two substrings within one step of the countdown, at most $m$ pairs of characters have to be compared.

The second summand is the asymptotic complexity of the main loop: the number of iterations is at most $n$, and each iteration takes constant time.

## Further information

There is a more sophisticated approach to computing $\Delta$, which results in $\mathcal{O}(m*|\Sigma |)$ worst-case complexity