Simple string matching algorithm

From Algowiki
Jump to: navigation, search
Simple string matching

Algorithmic problem: One-dimensional string matching

Abstract view

Type of algorithm: loop

Auxiliary data An ordered sequence of natural numbers.

Invariant: After iterations:

  1. In ascending order, contains exactly the start indexes of all occurrences of in that lie completely in the substring of . In other words, contains the start indexes in the range .
  2. In ascending order, contains exactly the start indexes of all "candidates," that is, all occurrences of in that lie partially in the substring of . In other words, contains the start indexes in the range .

Variant: increases by .

Break condition: iterations completed.

Induction basis

Abstract view: and have to be empty.

Implementation: obvious.

Proof: Nothing to show.

Induction step

Abstract view:

  1. For each start index in , we decide whether this is still a candidate, that is, whether . If not, we remove this start index from .
  2. Afterwards:
    1. If , is a new candidate and is thus appended at the tail of .
    2. If and is the head element of , this start index is removed from and appended at the tail of .

Implementation: Obvious.

Correctness:

  1. Due to the second invariant, the induction hypothesis implies that all elements of immediately before the -th iteration must be from the set . Therefore, if is in , it is the smallest element. Since the elements of are in ascending order, is then at the head of .
  2. A start index in is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies , this is the case if, and only if, .
  3. Clearly, it would be incorrect to transfer any element of except to . Thus, we are right to focus on in Step 2.2. Now, is in if, and only if, (1) it was present immediately before the -th iteration and (2) it has survived the first step of the -th iteration. The induction hypothesis implies , so it is correct to transfer to if, and only if, it is as well.
  4. Clearly, it would also be incorrect to add any element to except . So we are right to focus on in Step 2.1.By induction hypothesis, all elements of are less than , so we are also right to append at the tail of in case .

Complexity

Statement: Let denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in .

Proof: Obviously, the first step of an iteration takes time, and the other steps take constant time. The loop has iterations. Of course, it is .