BNDM algorithm 代码实现-yjm0573-ChinaUnix博客

梦里水乡

首页　| 　博文目录　| 　关于我

yjm0573

博客访问： 210735
博文数量： 35
博客积分： 2691
博客等级：少校
技术积分： 527
用户组：普通用户
注册时间： 2008-01-11 09:42

文章分类

全部博文（35）

shell（5）
未分配的博文（30）

文章存档

2012年（5）

2010年（6）

2009年（2）

2008年（22）

我的朋友

相关博文

BNDM algorithm 代码实现

分类： C/C++

2008-06-04 10:27:21

/*
* A class for string matching use the
* Backward Nondeterministic Dawg Matching algorithm
*
* 2008/2/26
*
* By Yao Jianming
*/

#ifndef _BNDMMATCH_H
#define _BNDMMATCH_H

#ifndef STRMATCH_MAXCHER
#define STRMATCH_MAXCHER 256
#endif

#include

class strmatch {
private:
std::string str;
int mask[STRMATCH_MAXCHER];

void makemasktable();

public:
strmatch(const char* p) : str(p) {makemasktable();}

void* search(const void* haystack, size_t haystack_len) const;
};

#endif

#include "strmatch.h"

void strmatch::makemasktable()
{
memset(mask, 0, STRMATCH_MAXCHER*sizeof(int));

const char* patt = str.c_str();
int len = str.size();

    int s =1;
    for(int i = len - 1; i >= 0; --i) {
        mask[patt[i]] |= s;
        s <<= 1;
    }
}

void* strmatch::search(const void* haystack, size_t haystack_len) const
{
    const char* text = (const char*)haystack;
    const char* patt = str.c_str();
    int len = str.size();
    int k = 0;

    while(k <= haystack_len - len) {
        int i, d = ~0, shift = len;
        for(i = len - 1; i >= 0 && d != 0; --i) {
            d &= mask[text[k+i]];
            d <<= 1;
        }

if(d != 0)
return (void*)(text + k + i + 1);

if(text[k+i+2] == patt[0])
shift = i + 2;

k += shift;
}

return (void*)NULL;
}

#include
#include "strmatch.h"

main()
{
struct timeval start, end;
gettimeofday(&start, 0);

    strmatch str = "hello";
    char text[] = "aaaehellhelloworld";
    char* p = (char*)str.search(text, strlen(text));
    if(p != NULL) printf("%s\n", p);

    gettimeofday(&end, 0);
    long ltime = (end.tv_sec - start.tv_sec) * 1000000 +
                    end.tv_usec - start.tv_usec;
    printf("ltime: %ld\n", ltime);
}

Backward Nondeterministic Dawg Matching algorithm

The Backward Nondeterministic Dawg Matching (BNDM) algorithm uses the same search approach as BDM, but the factor is searched using bit-parallelism. Compared to the original BDM algorithm, BNDM is simpler, uses less memory, has more locality of reference, and is easier to extend to more complex patterns ().

The idea is to maintain a set of positions on the reverse pattern that are the beginning positions of the string u read in the text. This set is stored with 0 and 1 as with Shift-And. The number 1, representing an active state at position j of p, means that the factor p_j…p_{j+∣u∣−1} is equal to u. shows this relationship. If the pattern is of size less that w, then the set fits in a computer word D = d_m…d₁.

Figure 2.15: Bit-parallel factor search. The table D keeps a list of the positions in p where the factor u begins.

We need to update the array D to D′ after reading a new character σ of the text. A state j of D′ is active if it corresponds to the beginning of the string σu in the pattern; that is, if

u began at position j + 1 in the pattern, which means that the (j + 1)-th position in D is active, and
σ is in position j in the pattern.

If we precompute a table B exactly as for Shift-And that associates to each letter of p the set of its positions in p with a bit mask, then we obtain D′ from D by the following formula:

However, there is a problem with the initialization. We would like to mark in the initial table D that each position of D matches the empty string, which means D should be 1^m. But in that case, the first shift will give (D << 1) = 1^m−10 and we will miss the first factor, which corresponds to the entire word. The simplest solution would be to take D of size m + 1, initialized to 1^m+1. However, it reduces to w − 1 the maximum length of the string that can be searched. Instead we split formula (2.2) into two parts.

We first perform the operation D′₁ ← D & B[σ] and verify the match, and then we perform the register shift D′ ← D′₁ << 1. The initialization is then D = 1^m. A string read in the text is a prefix of p if the first position is active, that is, if in D′₁ the position d_m is active.

The BNDM algorithm is the same as BDM, except that the factor search is done with the bit-parallelism technique. Each time the bit d_m is active, the position of the window is stored in the variable last. Pseudo-code for the algorithm is given in .

BNDM (p = p₁p₂...p_m, T = t₁t₂...t_n)
 1.     Preprocessing
 2.         For c ∊ Σ Do B[c] ← 0^m
 3.         For j ∊ 1 … m Do B[p_j] ← B[p_j] | 0^j-1 10^m-j
4.      Searching
5.          pos ← 0
6.          While pos ≲ n - m Do
7.                j ← m, last ← m
8.                D ← 1^m
9.                While D ≠ 0^m Do
10.                     D ← D & B[t_pos+j]
11.                     j ← j - 1
12.                     If D & 10^m-1 ≠ o^m Then
13.                           If j > 0 Then last ← j
14.                           Else report an occurrence at pos + 1
15.                     End of if
16.                     D ← D << 1
17.               End of while
18.               pod ← pos + last
19.         End of while

Figure 2.16: Bit-parallel pseudo-code for BNDM.

BNDM has the same worst-case complexity O(mn) as BDM, and also the same optimal average complexity O(n log_∣Σ∣ m/m).

From an automaton point of view, the bit-parallel factor search is a simulation of a nondeterministic automaton that recognizes all suffixes of the reverse pattern. For example, if we search the pattern "announce", we simulate the automaton shown in . It turns out that the minimal deterministic version of this automaton is the suffix automaton used in the classic BDM. The difference between BNDM and BDM is conceptually the same as that between Shift-Or and KMP. The former simulates a non-deterministic automaton using bit-parallelism, and the latter first obtains a representation of the deterministic automaton.

Figure 2.17: Nondeterministic automaton recognizing all factors of the reverse string of "announce".

Example of BNDM using English We search for the string "announce" in the text "CPM_annual_conference_announce".

al_conference_announcelast ← 8
CPM_annu rence_announce last ← 8
CPM_annual_confe nounce last ← 8

The position d₈ is active, but j > 0, so we set last ← 6.
CPM_annual_conference_ last ← 8

The position d₈ is active and j = 0, so we mark an occurrence.

Example of BNDM using DNA We search for the string ATATA in the sequence AGATACGATATATAC.

CGATATATAC last ← 5

The position d₅ is active, but j > 0, so we set last ← 2.
AG ATATATAC last ← 5
AGATACG TAC last ← 5

The position d₅ is active, but j > 0, so we set last ← 2.

The position d₅ is active and j = 0, so we mark an occurrence.
AGATACGAT C last ← 5

The position d₅ is active, but j > 0, so we set last ← − 2.

The position d₅ is active and j = 0, so we mark a new occurrence. We then shift to pos + last and pos > n − m, so the search stops.

阅读(1830) | 评论(0) | 转发(0) |

上一篇：内存池STL代码摘要

下一篇：快速排序算法

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6