Simple string matching algorithm: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
(11 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Videos]] | |||
[[Category:Algorithm]] | [[Category:Algorithm]] | ||
[[Category:Main Algorithm]] | [[Category:Main Algorithm]] | ||
{{#ev:youtube|https://www.youtube.com/watch?v=5p4fZGRaYuo|500|right|Simple string matching|frame}} | |||
'''Algorithmic problem:''' [[One-dimensional string matching]] | '''Algorithmic problem:''' [[One-dimensional string matching]] | ||
''' | ==Abstract view== | ||
'''Type of algorithm:''' loop | |||
''' | '''Auxiliary data''' An [[ordered sequence]] <math>I</math> of natural numbers. | ||
'''Invariant:''' After <math>i\ge 0</math> iterations: | |||
# In ascending order, <math>R</math> contains exactly the start indexes of all occurrences of <math>T</math> in <math>S</math> that lie completely in the substring <math>(S[1],...,S[i])</math> of <math>S</math>. In other words, <math>R</math> contains the start indexes in the range <math>S[1],...,S[i-m+1]</math>. | |||
# In ascending order, <math>I</math> contains exactly the start indexes of all "candidates," that is, all occurrences of <math>T</math> in <math>S</math> that lie partially in the substring <math>(S[1],...,S[i])</math> of <math>S</math>. In other words, <math>I</math> contains the start indexes in the range <math>S[i-m+2],...,S[i]</math>. | |||
''' | '''Variant:''' <math>i</math> increases by <math>1</math>. | ||
'''Break condition:''' <math>n</math> iterations completed. | |||
==Induction basis== | ==Induction basis== | ||
'''Abstract view:''' <math>R</math> and <math>I</math> have to be empty. | |||
'''Implementation:''' obvious. | |||
'''Proof:''' Nothing to show. | |||
==Induction step== | ==Induction step== | ||
'''Abstract view:''' | |||
# For each start index <math>j</math> in <math>I</math>, we decide whether this is still a candidate, that is, whether <math>S[i]=T[i-j+1]</math>. If not, we remove this start index from <math>I</math>. | |||
# Afterwards: | |||
## If <math>S[i]=T[1]</math>, <math>i</math> is a new candidate and is thus appended at the tail of <math>I</math>. | |||
## If <math>I\neq\emptyset</math> and <math>i-m+1</math> is the head element of <math>I</math>, this start index is removed from <math>I</math> and appended at the tail of <math>R</math>. | |||
'''Implementation:''' Obvious. | |||
'''Correctness:''' | |||
# Due to the second invariant, the induction hypothesis implies that all elements of <math>I</math> immediately before the <math>i</math>-th iteration must be from the set <math>\{i-m+1,...,i-1\}</math>. Therefore, if <math>i-m+1</math> is in <math>I</math>, it is the smallest element. Since the elements of <math>I</math> are in ascending order, <math>i-m+1</math> is then at the head of <math>I</math>. | |||
# A start index <math>j</math> in <math>I</math> is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies <math>(S[j],...,S[i-1])=(T[1],...,T[i-j])</math>, this is the case if, and only if, <math>S[i]\neq T[i-j+1]</math>. | |||
# Clearly, it would be incorrect to transfer any element of <math>I</math> except <math>i-m+1</math> to <math>R</math>. Thus, we are right to focus on <math>i-m+1</math> in Step 2.2. Now, <math>i-m+1</math> is in <math>I</math> if, and only if, (1) it was present immediately before the <math>i</math>-th iteration and (2) it has survived the first step of the <math>i</math>-th iteration. The induction hypothesis implies <math>(S[i-m+1],...,S[i-1])=(T[1],...,T[m-1])</math>, so it is correct to transfer <math>i-m+1</math> to <math>R</math> if, and only if, it is <math>S[i]=T[m]</math> as well. | |||
# Clearly, it would also be incorrect to add any element to <math>I</math> except <math>i</math>. So we are right to focus on <math>i</math> in Step 2.1.By induction hypothesis, all elements of <math>I</math> are less than <math>i</math>, so we are also right to append <math>i</math> at the tail of <math>I</math> in case <math>S[i]=T[1]</math>. | |||
==Complexity== | ==Complexity== | ||
'''Statement:''' Let <math>r\in \mathbb{N}</math> denote the maximal number of candidates to be considered simultaneously. Then the worst-case run time is in <math>\mathcal{O}(n\cdot r)\subseteq\mathcal{O}(n\cdot m)</math>. | |||
'''Proof:''' Obviously, the first step of an iteration takes <math>\mathcal{O}(r)</math> time, and the other steps take constant time. The loop has <math>n</math> iterations. Of course, it is <math>r\leq m</math>. |
Latest revision as of 10:13, 21 September 2015
Algorithmic problem: One-dimensional string matching
Abstract view
Type of algorithm: loop
Auxiliary data An ordered sequence
Invariant: After
- In ascending order,
contains exactly the start indexes of all occurrences of in that lie completely in the substring of . In other words, contains the start indexes in the range . - In ascending order,
contains exactly the start indexes of all "candidates," that is, all occurrences of in that lie partially in the substring of . In other words, contains the start indexes in the range .
Variant:
Break condition:
Induction basis
Abstract view:
Implementation: obvious.
Proof: Nothing to show.
Induction step
Abstract view:
- For each start index
in , we decide whether this is still a candidate, that is, whether . If not, we remove this start index from . - Afterwards:
- If
, is a new candidate and is thus appended at the tail of . - If
and is the head element of , this start index is removed from and appended at the tail of .
- If
Implementation: Obvious.
Correctness:
- Due to the second invariant, the induction hypothesis implies that all elements of
immediately before the -th iteration must be from the set . Therefore, if is in , it is the smallest element. Since the elements of are in ascending order, is then at the head of . - A start index
in is to be dropped if, and only if, it turns out not to be a promising candidate anymore. As the induction hypothesis implies , this is the case if, and only if, . - Clearly, it would be incorrect to transfer any element of
except to . Thus, we are right to focus on in Step 2.2. Now, is in if, and only if, (1) it was present immediately before the -th iteration and (2) it has survived the first step of the -th iteration. The induction hypothesis implies , so it is correct to transfer to if, and only if, it is as well. - Clearly, it would also be incorrect to add any element to
except . So we are right to focus on in Step 2.1.By induction hypothesis, all elements of are less than , so we are also right to append at the tail of in case .
Complexity
Statement: Let
Proof: Obviously, the first step of an iteration takes