分类: C/C++
2008-06-04 10:27:21
The Backward Nondeterministic Dawg Matching (BNDM) algorithm uses the same search approach as BDM, but the factor is searched using bit-parallelism. Compared to the original BDM algorithm, BNDM is simpler, uses less memory, has more locality of reference, and is easier to extend to more complex patterns ().
The idea is to maintain a set of positions on the reverse pattern that are the beginning positions of the string u read in the text. This set is stored with 0 and 1 as with Shift-And. The number 1, representing an active state at position j of p, means that the factor pj…pj+∣u∣−1 is equal to u. shows this relationship. If the pattern is of size less that w, then the set fits in a computer word D = dm…d1.
We need to update the array D to D′ after reading a new character σ of the text. A state j of D′ is active if it corresponds to the beginning of the string σu in the pattern; that is, if
u began at position j + 1 in the pattern, which means that the (j + 1)-th position in D is active, and
σ is in position j in the pattern.
If we precompute a table B exactly as for Shift-And that associates to each letter of p the set of its positions in p with a bit mask, then we obtain D′ from D by the following formula:
However, there is a problem with the initialization. We would like to mark in the initial table D that each position of D matches the empty string, which means D should be 1m. But in that case, the first shift will give (D << 1) = 1m−10 and we will miss the first factor, which corresponds to the entire word. The simplest solution would be to take D of size m + 1, initialized to 1m+1. However, it reduces to w − 1 the maximum length of the string that can be searched. Instead we split formula (2.2) into two parts.
We first perform the operation D′1 ← D & B[σ] and verify the match, and then we perform the register shift D′ ← D′1 << 1. The initialization is then D = 1m. A string read in the text is a prefix of p if the first position is active, that is, if in D′1 the position dm is active.
The BNDM algorithm is the same as BDM, except that the factor search is done with the bit-parallelism technique. Each time the bit dm is active, the position of the window is stored in the variable last. Pseudo-code for the algorithm is given in .
BNDM (p = p1p2...pm, T = t1t2...tn) 1. Preprocessing 2. For c ∊ Σ Do B[c] ← 0m 3. For j ∊ 1 … m Do B[pj] ← B[pj] | 0j-1 10m-j 4. Searching 5. pos ← 0 6. While pos ≲ n - m Do 7. j ← m, last ← m 8. D ← 1m 9. While D ≠ 0m Do 10. D ← D & B[tpos+j] 11. j ← j - 1 12. If D & 10m-1 ≠ om Then 13. If j > 0 Then last ← j 14. Else report an occurrence at pos + 1 15. End of if 16. D ← D << 1 17. End of while 18. pod ← pos + last 19. End of while
BNDM has the same worst-case complexity O(mn) as BDM, and also the same optimal average complexity O(n log∣Σ∣ m/m).
From an automaton point of view, the bit-parallel factor search is a simulation of a nondeterministic automaton that recognizes all suffixes of the reverse pattern. For example, if we search the pattern "announce", we simulate the automaton shown in . It turns out that the minimal deterministic version of this automaton is the suffix automaton used in the classic BDM. The difference between BNDM and BDM is conceptually the same as that between Shift-Or and KMP. The former simulates a non-deterministic automaton using bit-parallelism, and the latter first obtains a representation of the deterministic automaton.
Example of BNDM using English We search for the string "announce" in the text "CPM_annual_conference_announce".
al_conference_announcelast ← 8
CPM_annu rence_announce last ← 8
CPM_annual_confe nounce last ← 8
The position d8 is active, but j > 0, so we set last ← 6.
CPM_annual_conference_ last ← 8
The position d8 is active and j = 0, so we mark an occurrence.
Example of BNDM using DNA We search for the string ATATA in the sequence AGATACGATATATAC.
CGATATATAC last ← 5
The position d5 is active, but j > 0, so we set last ← 2.
AG ATATATAC last ← 5
AGATACG TAC last ← 5
The position d5 is active, but j > 0, so we set last ← 2.
The position d5 is active and j = 0, so we mark an occurrence.
AGATACGAT C last ← 5
The position d5 is active, but j > 0, so we set last ← − 2.
The position d5 is active and j = 0, so we mark a new occurrence. We then shift to pos + last and pos > n − m, so the search stops.