Simple string matching algorithm

From Algowiki
Jump to: navigation, search
Simple string matching

Algorithmic problem: One-dimensional string matching

Abstract view

Type of algorithm: loop

Auxiliary data An ordered sequence [math]I[/math] of natural numbers.

Invariant: After [math]i\ge 0[/math] iterations:

  1. In ascending order, [math]R[/math] contains exactly the start indexes of all occurrences of [math]T[/math] in [math]S[/math] that lie completely in the substring [math](S[1],...,S[i])[/math] of [math]S[/math]. In other words, [math]R[/math] contains the start indexes in the range [math]S[1],...,S[i-m+1][/math].
  2. In ascending order, [math]I[/math] contains exactly the start indexes of all "candidates," that is, all occurrences of [math]T[/math] in [math]S[/math] that lie partially in the substring [math](S[1],...,S[i])[/math] of [math]S[/math]. In other words, [math]I[/math] contains the start indexes in the range [math]S[i-m+2],...,S[i][/math].

Variant: [math]i[/math] increases by [math]1[/math].

Break condition: [math]n[/math] iterations completed.

Induction basis

Abstract view: [math]R[/math] and [math]I[/math] have to be empty.

Implementation: obvious.

Proof: Nothing to show.

Induction step

Abstract view:

  1. For each start index [math]j[/math] in [math]I[/math], we decide whether this is still a candidate, that is, whether [math]S[i]=T[i-j+1][/math]. If not, we remove this start index from [math]I[/math].
  2. Afterwards:
    1. If [math]S[i]=T[1][/math], [math]i[/math] is a new candidate and is thus appended at the tail of [math]I[/math].
    2. If [math]I\neq\emptyset[/math] and [math]i-m+1[/math] is the head element of [math]I[/math], this start index is removed from [math]I[/math] and appended at the tail of [math]R[/math].

Implementation: Obvious.

Correctness:

  1. Due to the second invariant, the induction hypothesis implies that all elements of [math]I[/math] immediately before the [math]i[/math]-th iteration must be from the set [math]\{i-m+1,...,i-1\}[/math]. Therefore, if [math]i-m+1[/math] is in [math]I[/math], it is the smallest element. Since the elements of [math]I[/math] are in ascending order, [math]i-m+1[/math] is then at the head of [math]I[/math].
  2. A start index [math]j[/math] in [math]I[/math] is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies [math](S[j],...,S[i-1])=(T[1],...,T[i-j])[/math], this is the case if, and only if, [math]S[i]\neq T[i-j+1][/math].
  3. Clearly, it would be incorrect to transfer any element of [math]I[/math] except [math]i-m+1[/math] to [math]R[/math]. Thus, we are right to focus on [math]i-m+1[/math] in Step 2.2. Now, [math]i-m+1[/math] is in [math]I[/math] if, and only if, (1) it was present immediately before the [math]i[/math]-th iteration and (2) it has survived the first step of the [math]i[/math]-th iteration. The induction hypothesis implies [math](S[i-m+1],...,S[i-1])=(T[1],...,T[m-1])[/math], so it is correct to transfer [math]i-m+1[/math] to [math]R[/math] if, and only if, it is [math]S[i]=T[m][/math] as well.
  4. Clearly, it would also be incorrect to add any element to [math]I[/math] except [math]i[/math]. So we are right to focus on [math]i[/math] in Step 2.1.By induction hypothesis, all elements of [math]I[/math] are less than [math]i[/math], so we are also right to append [math]i[/math] at the tail of [math]I[/math] in case [math]S[i]=T[1][/math].

Complexity

Statement: Let [math]r\in \mathbb{N}[/math] denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in [math]\mathcal{O}(n\cdot r)\subseteq\mathcal{O}(n\cdot m)[/math].

Proof: Obviously, the first step of an iteration takes [math]\mathcal{O}(r)[/math] time, and the other steps take constant time. The loop has [math]n[/math] iterations. Of course, it is [math]r\leq m[/math].