# Simple string matching algorithm

Simple string matching

Algorithmic problem: One-dimensional string matching

## Abstract view

Type of algorithm: loop

Auxiliary data An ordered sequence $I$ of natural numbers.

Invariant: After $i\ge 0$ iterations:

1. In ascending order, $R$ contains exactly the start indexes of all occurrences of $T$ in $S$ that lie completely in the substring $(S,...,S[i])$ of $S$. In other words, $R$ contains the start indexes in the range $S,...,S[i-m+1]$.
2. In ascending order, $I$ contains exactly the start indexes of all "candidates," that is, all occurrences of $T$ in $S$ that lie partially in the substring $(S,...,S[i])$ of $S$. In other words, $I$ contains the start indexes in the range $S[i-m+2],...,S[i]$.

Variant: $i$ increases by $1$.

Break condition: $n$ iterations completed.

## Induction basis

Abstract view: $R$ and $I$ have to be empty.

Implementation: obvious.

Proof: Nothing to show.

## Induction step

Abstract view:

1. For each start index $j$ in $I$, we decide whether this is still a candidate, that is, whether $S[i]=T[i-j+1]$. If not, we remove this start index from $I$.
2. Afterwards:
1. If $S[i]=T$, $i$ is a new candidate and is thus appended at the tail of $I$.
2. If $I\neq\emptyset$ and $i-m+1$ is the head element of $I$, this start index is removed from $I$ and appended at the tail of $R$.

Implementation: Obvious.

Correctness:

1. Due to the second invariant, the induction hypothesis implies that all elements of $I$ immediately before the $i$-th iteration must be from the set $\{i-m+1,...,i-1\}$. Therefore, if $i-m+1$ is in $I$, it is the smallest element. Since the elements of $I$ are in ascending order, $i-m+1$ is then at the head of $I$.
2. A start index $j$ in $I$ is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies $(S[j],...,S[i-1])=(T,...,T[i-j])$, this is the case if, and only if, $S[i]\neq T[i-j+1]$.
3. Clearly, it would be incorrect to transfer any element of $I$ except $i-m+1$ to $R$. Thus, we are right to focus on $i-m+1$ in Step 2.2. Now, $i-m+1$ is in $I$ if, and only if, (1) it was present immediately before the $i$-th iteration and (2) it has survived the first step of the $i$-th iteration. The induction hypothesis implies $(S[i-m+1],...,S[i-1])=(T,...,T[m-1])$, so it is correct to transfer $i-m+1$ to $R$ if, and only if, it is $S[i]=T[m]$ as well.
4. Clearly, it would also be incorrect to add any element to $I$ except $i$. So we are right to focus on $i$ in Step 2.1.By induction hypothesis, all elements of $I$ are less than $i$, so we are also right to append $i$ at the tail of $I$ in case $S[i]=T$.

## Complexity

Statement: Let $r\in \mathbb{N}$ denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in $\mathcal{O}(n\cdot r)\subseteq\mathcal{O}(n\cdot m)$.

Proof: Obviously, the first step of an iteration takes $\mathcal{O}(r)$ time, and the other steps take constant time. The loop has $n$ iterations. Of course, it is $r\leq m$.