Chinaunix首页 | 论坛 | 博客
  • 博客访问: 205495
  • 博文数量: 35
  • 博客积分: 2691
  • 博客等级: 少校
  • 技术积分: 527
  • 用 户 组: 普通用户
  • 注册时间: 2008-01-11 09:42
文章分类

全部博文(35)

文章存档

2012年(5)

2010年(6)

2009年(2)

2008年(22)

我的朋友

分类: C/C++

2008-06-04 10:27:21

/*
 * A class for string matching use the
 *  Backward Nondeterministic Dawg Matching algorithm
 *
 *  2008/2/26
 *
 *  By Yao Jianming
 */
#ifndef _BNDMMATCH_H
#define _BNDMMATCH_H
#ifndef STRMATCH_MAXCHER
#define STRMATCH_MAXCHER 256
#endif
#include
class strmatch {
private:
    std::string str;
    int mask[STRMATCH_MAXCHER];
    void makemasktable();
public:
    strmatch(const char* p) : str(p) {makemasktable();}
    void* search(const void* haystack, size_t haystack_len) const;
};
#endif
 
 
#include "strmatch.h"
void strmatch::makemasktable()
{
    memset(mask, 0, STRMATCH_MAXCHER*sizeof(int));
    const char* patt = str.c_str();
    int len = str.size();
    int s =1;
    for(int i = len - 1; i >= 0; --i) {
        mask[patt[i]] |= s;
        s <<= 1;
    }
}
void* strmatch::search(const void* haystack, size_t haystack_len) const
{
    const char* text = (const char*)haystack;
    const char* patt = str.c_str();
    int len = str.size();
    int k = 0;
    while(k <= haystack_len - len) {
        int i, d = ~0, shift = len;
        for(i = len - 1; i >= 0 && d != 0; --i) {
            d &= mask[text[k+i]];
            d <<= 1;
        }
        if(d != 0)
            return (void*)(text + k + i + 1);
        if(text[k+i+2] == patt[0])
            shift = i + 2;
        k += shift;
    }
    return (void*)NULL;
}
 
 
#include
#include "strmatch.h"
main()
{
    struct timeval start, end;
    gettimeofday(&start, 0);
    strmatch str = "hello";
    char text[] = "aaaehellhelloworld";
    char* p = (char*)str.search(text, strlen(text));
    if(p != NULL) printf("%s\n", p);
    gettimeofday(&end, 0);
    long ltime = (end.tv_sec - start.tv_sec) * 1000000 +
                    end.tv_usec - start.tv_usec;
    printf("ltime: %ld\n", ltime);
}
 
 
 
Backward Nondeterministic Dawg Matching algorithm

The Backward Nondeterministic Dawg Matching (BNDM) algorithm uses the same search approach as BDM, but the factor is searched using bit-parallelism. Compared to the original BDM algorithm, BNDM is simpler, uses less memory, has more locality of reference, and is easier to extend to more complex patterns ().

The idea is to maintain a set of positions on the reverse pattern that are the beginning positions of the string u read in the text. This set is stored with 0 and 1 as with Shift-And. The number 1, representing an active state at position j of p, means that the factor pjpj+u∣−1 is equal to u. shows this relationship. If the pattern is of size less that w, then the set fits in a computer word D = dmd1.


Figure 2.15: Bit-parallel factor search. The table D keeps a list of the positions in p where the factor u begins.

We need to update the array D to D after reading a new character σ of the text. A state j of D is active if it corresponds to the beginning of the string σu in the pattern; that is, if

  • u began at position j + 1 in the pattern, which means that the (j + 1)-th position in D is active, and

  • σ is in position j in the pattern.

If we precompute a table B exactly as for Shift-And that associates to each letter of p the set of its positions in p with a bit mask, then we obtain D from D by the following formula:

Image from book

However, there is a problem with the initialization. We would like to mark in the initial table D that each position of D matches the empty string, which means D should be 1m. But in that case, the first shift will give (D << 1) = 1m10 and we will miss the first factor, which corresponds to the entire word. The simplest solution would be to take D of size m + 1, initialized to 1m+1. However, it reduces to w 1 the maximum length of the string that can be searched. Instead we split formula (2.2) into two parts.

We first perform the operation D1 D & B[σ] and verify the match, and then we perform the register shift D D1 << 1. The initialization is then D = 1m. A string read in the text is a prefix of p if the first position is active, that is, if in D1 the position dm is active.

The BNDM algorithm is the same as BDM, except that the factor search is done with the bit-parallelism technique. Each time the bit dm is active, the position of the window is stored in the variable last. Pseudo-code for the algorithm is given in .

Image from book
BNDM (p = p1p2...pm, T = t1t2...tn)
 1.     Preprocessing
 2.         For c  Σ Do B[c]  0m
 3.         For j  1  m Do B[pj]  B[pj] | 0j-1 10m-j
4.      Searching
5.          pos  0
6.          While pos  n - m Do
7.                j  m, last  m
8.                D  1m
9.                While D  0m Do
10.                     D  D & B[tpos+j]
11.                     j  j - 1
12.                     If D & 10m-1  om Then
13.                           If j > 0 Then last  j
14.                           Else report an occurrence at pos + 1
15.                     End of if
16.                     D  D << 1
17.               End of while
18.               pod  pos + last
19.         End of while
Image from book

Figure 2.16: Bit-parallel pseudo-code for BNDM.

BNDM has the same worst-case complexity O(mn) as BDM, and also the same optimal average complexity O(n log∣Σ∣ m/m).

From an automaton point of view, the bit-parallel factor search is a simulation of a nondeterministic automaton that recognizes all suffixes of the reverse pattern. For example, if we search the pattern "announce", we simulate the automaton shown in . It turns out that the minimal deterministic version of this automaton is the suffix automaton used in the classic BDM. The difference between BNDM and BDM is conceptually the same as that between Shift-Or and KMP. The former simulates a non-deterministic automaton using bit-parallelism, and the latter first obtains a representation of the deterministic automaton.


Figure 2.17: Nondeterministic automaton recognizing all factors of the reverse string of "announce".

Example of BNDM using English We search for the string "announce" in the text "CPM_annual_conference_announce".

Image from book
  1. Image from book al_conference_announcelast 8

    Image from book
  2. CPM_annu Image from book rence_announce last 8

    Image from book
  3. CPM_annual_confe Image from book nounce last 8

    Image from book

    The position d8 is active, but j > 0, so we set last 6.

  4. CPM_annual_conference_ Image from book last 8

    Image from book

    The position d8 is active and j = 0, so we mark an occurrence.

Example of BNDM using DNA We search for the string ATATA in the sequence AGATACGATATATAC.

Image from book
  1. Image from book CGATATATAC last 5

    Image from book

    The position d5 is active, but j > 0, so we set last 2.

    Image from book
  2. AG Image from book ATATATAC last 5

    Image from book
  3. AGATACG Image from book TAC last 5

    Image from book

    The position d5 is active, but j > 0, so we set last 2.

    Image from book

    The position d5 is active and j = 0, so we mark an occurrence.

  4. AGATACGAT Image from book C last 5

    Image from book

    The position d5 is active, but j > 0, so we set last 2.

    Image from book

    The position d5 is active and j = 0, so we mark a new occurrence. We then shift to pos + last and pos > n m, so the search stops.

阅读(1782) | 评论(0) | 转发(0) |
0

上一篇:内存池STL代码摘要

下一篇:快速排序算法

给主人留下些什么吧!~~