Chinaunix首页 | 论坛 | 博客
  • 博客访问: 209452
  • 博文数量: 35
  • 博客积分: 2691
  • 博客等级: 少校
  • 技术积分: 527
  • 用 户 组: 普通用户
  • 注册时间: 2008-01-11 09:42
文章分类

全部博文(35)

文章存档

2012年(5)

2010年(6)

2009年(2)

2008年(22)

我的朋友

分类: C/C++

2008-03-03 10:44:25

//mpmatch.h
 
/*
 * A class for multi-pattern matching use the
 *  Multiple BNDM algorithm
 *
 *  2007/2/29
 *
 *  By Yao Jianming    
 */
#include
#include
class mpmatch {
public:
    typedef std::pair pattern_type;
    typedef int (*callback_type)(size_t, const void*, void*);
    void add_pattern(size_t id , const std::string& pattern);
    bool prepare_patterns();
    int search(const void* haystack,
                size_t haystack_len,
                void* result,
                callback_type report_func = NULL) const;
private:
    std::vector _patterns;
    int _min_pattern_len;
    int _mask_table[256];
    int _hash_table[256];
    int _hash_n;
    void preprocess();
};
 
 
//mpmatch.c
 
#include "mpmatch.h"
void mpmatch::add_pattern(size_t id, const std::string& pattern)
{
    if(_patterns.size() == 0)
        _min_pattern_len = pattern.size();
    else if(_min_pattern_len > pattern.size())
        _min_pattern_len = pattern.size();
    _patterns.push_back(make_pair(id, pattern));
}
bool mpmatch::prepare_patterns()
{
    if((_min_pattern_len*_patterns.size()) > (sizeof(long)*8)
            || _min_pattern_len < 2
            || _patterns.size() <= 0)
        return false;
    _hash_n = 0;
    preprocess();
    return true;
}
void mpmatch::preprocess()
{
    for(int i=0; i<256; ++i) {
        _mask_table[i] = 0;
    }
    std::vector::const_iterator end = _patterns.end();
    std::vector::const_iterator it = _patterns.begin();
    int s = 1;
    for(; it != end; ++it) {
        const char* raw_ptr = it->second.c_str() + _min_pattern_len - 1;
        const char* raw_ptr_beg = it->second.c_str();
        while(raw_ptr > raw_ptr_beg) {
            _mask_table[*raw_ptr] |= s;
            s <<= 1; --raw_ptr;
        }
        _mask_table[*raw_ptr] |= s;
        _hash_table[_hash_n++] = s;
    }
}
int mpmatch::search(
        const void* haystack,
        size_t haystack_len,
        void* result,
        callback_type report_func) const
{
    const char* text_ptr = (const char*)haystack;
    const char* text_ptr_end = (const char*)haystack + haystack_len;
    while(text_ptr <= text_ptr_end - _min_pattern_len) {
        const char* raw_ptr = text_ptr + _min_pattern_len - 1;
        int d = ~0;
        while(raw_ptr >= text_ptr) {
            d &= _mask_table[*raw_ptr];
            if(d == 0 || raw_ptr == text_ptr)
                break;
            d <<= 1; --raw_ptr;
        }
        if(d == 0) {
            text_ptr += raw_ptr - text_ptr + 1;
        } else {
            for(int i = 0; i < _hash_n; ++i) {
                if((d & _hash_table[i]) != 0) {
                    const char* patt_ptr = _patterns[i].second.c_str();
                    int patt_len = _patterns[i].second.size();
                    const char* p = text_ptr;
                    while(--patt_len != 0 && *patt_ptr == *p) {
                        ++patt_ptr; ++p;
                    }
                    if(patt_len == 0) {
                        int nRtn = (*report_func)(_patterns[i].first,
                                text_ptr,
                                result);
                        if(nRtn != 0) return nRtn;
                    }
                }
            }
            text_ptr += 1;
        }
    }
    return 0;
}
 
//test.c
#include "mpmatch.h"
int matchkey(size_t id, const void* ptr, void* result)
{
    printf("%d\n", id);
    return 0;
}
int main()
{
    mpmatch example;
    example.add_pattern(1, "hello");
    example.add_pattern(2, "world");
    bool flag = example.prepare_patterns();
    if(flag == false) return -1;
    char buf[1024] = "aaahello worlddfdfdhello";
    example.search(buf, strlen(buf), NULL, matchkey);
    return 0;
}
 
 
 Multiple BNDM algorithm

The use of bit-parallelism to search a set of strings P = {p1,,pr} is efficient for sets such that r × min fits in a few computer words [NR00]. For simplicity we assume below that r × min w.

To perform longer shifts, we keep only the prefixes of size min of the patterns. If we match a prefix, we directly verify the entire string against the text.

The prefixes are packed together as in and the search is similar to BNDM (), with the search performed for all the prefixes at the same time. The only difference is that we need to clear some bits after a shift. The mask CL in does that. It prevents the bits used to search for pi from interfering with those used for pi+1. The variable last is still used, but in this case it represents the position where a prefix of one of the strings begins. Pseudo-code of the whole algorithm is shown in .


Figure 3.18: Multiple BNDM algorithm. The total r × min has to fit in w. The notation prefmin(Pi) denotes the prefix of size min of pi.
Image from book
Multiple BNDM (p = p1p2  pm, T = t1t2  tn)
1.      Preprocessing
2.          For c  σ Do B[c]  0|P|
3.            0
4.          For k  1  r Do
5.                 + min
6.              For j  1  min Do B[pkj)  B[pkj] | 0|p|-+j-1 10-j
7.          End of for
8.          CL  (1min - 1 0)r
9.          DF  (10min - 1)r
10.    Searching
11.         pos  0
12.         While pos  n - m Do
13.                j  min, last  min
14.                D = 1|P|
15.                While D  0|P| Do
16.                      D  D & B[tpos + j]
17.                      j  j - 1
18.                      If D & DF  0|P| Then
19.                      If j > 0 Then last  j
20.                      Else        /* at least one prefix matches */
21.                          Check which prefixes of length min match
                                 pi needs to be checked if
                                        D & 0|P| - min × i 10min - 1 0min × (i - 1)  0|P|
22.                          Verify the corresponding string(s) against the test
23.                          Report the occurrence(s) at pos + 1
24.                      End of if
25.                  End of if
26.                  D  (D << 1) & CL       /* Shifting and cleaning */
27.             End of while
28.             pos  pos + last
29.         End of while
Image from book

Figure 3.19: Bit-parallel code for the Multiple BNDM algorithm.

Example using English We search the text "CPM_annual_conference_announce" for the set of strings P = {announce, annual, annually}.

Image from book
  1. Image from book nual_conference_announce
    last 6.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 4
    Image from book

  2. CPM_ Image from book _conference_announce
    last 6.
    D 111111111111111
    Image from book

    D & DF 0P and j = 0, so we check the patterns "annual" and "annually" against the text and mark the occurrence of "annual".

  3. CPM_annual Image from bookrence_announce
    last 6.
    D 111111111111111111
    Image from book

  4. CPM_annual_confeImage from bookannounce
    last 6.
    D 111111111111111111
    Image from book

  5. CPM_annual_conference_ Image from book ce
    last 6.
    D 111111111111111111
    Image from book

    D & DF 0P and j = 0, so we check the string "announce" against the text and mark its occurrence.

    The next shift of the search window gives pos > nmin and the search stops.

Example using DNA We search for the set of strings P = {ATATATA, TATAT, ACGATAT} in the text AGATACGATATATAC.

Image from book
DF = 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
CL = 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
  1. Image from book CGATATATAC
    last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 4
    Image from book
    D & DF 0P and j > 0, then last 3
    Image from book
    D & DF 0P and j > 0, then last 2
    Image from book

  2. AG Image from book ATATATAC
    last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 2
    Image from book

  3. AGAT Image from book ATATAC
    last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 4
    Image from book
    D & DF 0P and j > 0, then last 3
    Image from book
    D & DF 0P and j = 0, so we check the pattern ACGATAT against the text and mark its occurrence.

  4. AGATACG Image from book TAC
    last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 4
    Image from book
    D & DF 0P and j > 0, then last 3
    Image from book
    D & DF 0P and j > 0, then last 2
    Image from book
    D & DF 0P and j > 0, then last 1
    Image from book
    D & DF 0P and j = 0, so we check the string ATATATA against the text and mark its occurrence.

  5. AGATACGA Image from book AC
    last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 4
    Image from book
    D & DF 0P and j > 0, then last 3
    Image from book
    D & DF 0P and j > 0, then last 2
    Image from book
    D & DF 0P and j > 0, then last 1
    Image from book
    D & DF 0P and j = 0, so we check the string TATAT against the text and mark its occurrence.

  6. AGATACGAT Image from book C
    last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 4
    Image from book
    D & DF 0P and j > 0, then last 3
    Image from book
    D & DF 0P and j > 0, then last 2
    Image from book
    D & DF 0P and j > 0, then last 1
    Image from book
    D & DF 0P and j = 0, so we check the string ATATATA against the text, but we fail to recognize an occurrence.

  7. AGATACGATA Image from book last 5.
    D 111111111111111111
    Image from book
    D & DF 0P and j > 0, then last 3
    Image from book

阅读(1418) | 评论(0) | 转发(0) |
0

上一篇:没有了

下一篇:Extending Shift-And的实现

给主人留下些什么吧!~~