Simple string matching algorithm: Difference between revisions

Latest revision as of 10:13, 21 September 2015

Simple string matching

Algorithmic problem: One-dimensional string matching

Abstract view

Type of algorithm: loop

Auxiliary data An ordered sequence $I$ of natural numbers.

Invariant: After $i \geq 0$ iterations:

In ascending order, $R$ contains exactly the start indexes of all occurrences of $T$ in $S$ that lie completely in the substring $(S [1], . . ., S [i])$ of $S$ . In other words, $R$ contains the start indexes in the range $S [1], . . ., S [i - m + 1]$ .
In ascending order, $I$ contains exactly the start indexes of all "candidates," that is, all occurrences of $T$ in $S$ that lie partially in the substring $(S [1], . . ., S [i])$ of $S$ . In other words, $I$ contains the start indexes in the range $S [i - m + 2], . . ., S [i]$ .

Variant: $i$ increases by $1$ .

Break condition: $n$ iterations completed.

Induction basis

Abstract view: $R$ and $I$ have to be empty.

Implementation: obvious.

Proof: Nothing to show.

Induction step

Abstract view:

For each start index $j$ in $I$ , we decide whether this is still a candidate, that is, whether $S [i] = T [i - j + 1]$ . If not, we remove this start index from $I$ .
Afterwards:
1. If $S [i] = T [1]$ , $i$ is a new candidate and is thus appended at the tail of $I$ .
2. If $I \neq \emptyset$ and $i - m + 1$ is the head element of $I$ , this start index is removed from $I$ and appended at the tail of $R$ .

Implementation: Obvious.

Correctness:

Due to the second invariant, the induction hypothesis implies that all elements of $I$ immediately before the $i$ -th iteration must be from the set ${i - m + 1, . . ., i - 1}$ . Therefore, if $i - m + 1$ is in $I$ , it is the smallest element. Since the elements of $I$ are in ascending order, $i - m + 1$ is then at the head of $I$ .
A start index $j$ in $I$ is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies $(S [j], . . ., S [i - 1]) = (T [1], . . ., T [i - j])$ , this is the case if, and only if, $S [i] \neq T [i - j + 1]$ .
Clearly, it would be incorrect to transfer any element of $I$ except $i - m + 1$ to $R$ . Thus, we are right to focus on $i - m + 1$ in Step 2.2. Now, $i - m + 1$ is in $I$ if, and only if, (1) it was present immediately before the $i$ -th iteration and (2) it has survived the first step of the $i$ -th iteration. The induction hypothesis implies $(S [i - m + 1], . . ., S [i - 1]) = (T [1], . . ., T [m - 1])$ , so it is correct to transfer $i - m + 1$ to $R$ if, and only if, it is $S [i] = T [m]$ as well.
Clearly, it would also be incorrect to add any element to $I$ except $i$ . So we are right to focus on $i$ in Step 2.1.By induction hypothesis, all elements of $I$ are less than $i$ , so we are also right to append $i$ at the tail of $I$ in case $S [i] = T [1]$ .

Complexity

Statement: Let $r \in N$ denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in $O (n \cdot r) \subseteq O (n \cdot m)$ .

Proof: Obviously, the first step of an iteration takes $O (r)$ time, and the other steps take constant time. The loop has $n$ iterations. Of course, it is $r \leq m$ .

@@ Line 1: / Line 1: @@
-__NOTOC__
+[[Category:Videos]]
 [[Category:Algorithm]]
 [[Category:Main Algorithm]]
+{{#ev:youtube|https://www.youtube.com/watch?v=5p4fZGRaYuo|500|right|Simple string matching|frame}}
 '''Algorithmic problem:''' [[One-dimensional string matching]]
-'''Prerequisites:'''
+==Abstract view==
+'''Type of algorithm:''' loop
-'''Type of algorithm:''' loop
+'''Auxiliary data''' An [[ordered sequence]] <math>I</math> of natural numbers.
+'''Invariant:''' After <math>i\ge 0</math> iterations:
+# In ascending order, <math>R</math> contains exactly the start indexes of all occurrences of <math>T</math> in <math>S</math> that lie completely in the substring <math>(S[1],...,S[i])</math> of <math>S</math>. In other words, <math>R</math> contains the start indexes in the range <math>S[1],...,S[i-m+1]</math>.
+# In ascending order, <math>I</math> contains exactly the start indexes of all "candidates," that is, all occurrences of <math>T</math> in <math>S</math> that lie partially in the substring <math>(S[1],...,S[i])</math> of <math>S</math>. In other words, <math>I</math> contains the start indexes in the range <math>S[i-m+2],...,S[i]</math>.
-'''Auxiliary data''' A [[FIFO queue]] <math>I</math> of natural numbers, which always contains the start indexes of all occurrences of <math>T</math> in <math>S</math> that have been seen only partially thus far.
+'''Variant:''' <math>i</math> increases by <math>1</math>.
-==Abstract view==
+'''Break condition:''' <math>n</math> iterations completed.
 ==Induction basis==
+'''Abstract view:''' <math>R</math> and <math>I</math> have to be empty.
+'''Implementation:''' obvious.
+'''Proof:''' Nothing to show.
 ==Induction step==
+'''Abstract view:'''
+# For each start index <math>j</math> in <math>I</math>, we decide whether this is still a candidate, that is, whether <math>S[i]=T[i-j+1]</math>. If not, we remove this start index from <math>I</math>.
+# Afterwards:
+## If <math>S[i]=T[1]</math>, <math>i</math> is a new candidate and is thus appended at the tail of <math>I</math>.
+## If <math>I\neq\emptyset</math> and <math>i-m+1</math> is the head element of <math>I</math>, this start index is removed from <math>I</math> and appended at the tail of <math>R</math>.
+'''Implementation:''' Obvious.
+'''Correctness:'''
+# Due to the second invariant, the induction hypothesis implies that all elements of <math>I</math> immediately before the <math>i</math>-th iteration must be from the set <math>\{i-m+1,...,i-1\}</math>. Therefore, if <math>i-m+1</math> is in <math>I</math>, it is the smallest element. Since the elements of <math>I</math> are in ascending order, <math>i-m+1</math> is then at the head of <math>I</math>.
+# A start index <math>j</math> in <math>I</math> is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies <math>(S[j],...,S[i-1])=(T[1],...,T[i-j])</math>, this is the case if, and only if, <math>S[i]\neq T[i-j+1]</math>.
+# Clearly, it would be incorrect to transfer any element of <math>I</math> except <math>i-m+1</math> to <math>R</math>. Thus, we are right to focus on <math>i-m+1</math> in Step 2.2. Now, <math>i-m+1</math> is in <math>I</math> if, and only if, (1) it was present immediately before the <math>i</math>-th iteration and (2) it has survived the first step of the <math>i</math>-th iteration. The induction hypothesis implies <math>(S[i-m+1],...,S[i-1])=(T[1],...,T[m-1])</math>, so it is correct to transfer <math>i-m+1</math> to <math>R</math> if, and only if, it is <math>S[i]=T[m]</math> as well.
+# Clearly, it would also be incorrect to add any element to <math>I</math> except <math>i</math>. So we are right to focus on <math>i</math> in Step 2.1.By induction hypothesis, all elements of <math>I</math> are less than <math>i</math>, so we are also right to append <math>i</math> at the tail of <math>I</math> in case <math>S[i]=T[1]</math>.
 ==Complexity==
+'''Statement:''' Let <math>r\in \mathbb{N}</math> denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in <math>\mathcal{O}(n\cdot r)\subseteq\mathcal{O}(n\cdot m)</math>.
-==Further information==
+'''Proof:''' Obviously, the first step of an iteration takes <math>\mathcal{O}(r)</math> time, and the other steps take constant time. The loop has <math>n</math> iterations. Of course, it is <math>r\leq m</math>.

Simple string matching algorithm: Difference between revisions

Latest revision as of 10:13, 21 September 2015

Contents

Abstract view

Induction basis

Induction step

Complexity

Navigation menu

Simple string matching algorithm: Difference between revisions

Latest revision as of 10:13, 21 September 2015

Abstract view

Induction basis

Induction step

Complexity

Navigation menu

Search