C++字符串匹配算法-lqk188-ChinaUnix博客

大狗狗的技术博客

首页　| 　博文目录　| 　关于我

lqk188

博客访问： 45841
博文数量： 15
博客积分： 0
博客等级：民兵
技术积分： 176
用户组：普通用户
注册时间： 2014-01-31 01:57

个人简介

资深码农

文章分类

全部博文（15）

python（3）
C++算法（11）
未分配的博文（1）

文章存档

2016年（3）

2014年（12）

我的朋友

相关博文

C++字符串匹配算法

分类： C/C++

2014-03-14 15:22:53

      提到字符串匹配算法，大家很快会想到著名的KMP算法，或者BM算法，他们都是前缀匹配和后缀匹配的经典算法，KMP算法的发明者之一就是大名鼎鼎的D.E.Knuth（不认识的请自行百度）。不过这里介绍另一个算法，其中利用的散列的技术，也算顺便复习一下散列的知识，与大家分享。
       先定义问题，设字符串1为模式字符串（pattern string）,字符串2为待搜索的目标字符串，比如字符串1=“USA”, 字符串2="I've never been in USA :("。我们的目的就是要在字符串2中找是否有字符串1出现，在给定的特定case中显然是true。那我们要想怎么找呢，最笨的办法就是蛮力搜素字符串1，从字符串2的第一个字符开始一轮一轮的与字符串进行比较，发现匹配则返回true，程序结束；发现不匹配则返回false，然后指针后移一位重新比较直到找到为止；这种方法时间复杂度为O(MxN),M为字符串1的长度，N为字符串2的长度，效率较低。
让我们换个思路，利用散列帮助我们提高效率。散列的基本思想是：把一个目标对象进行某种转换（借助某个散列函数），得到一个数字（hash value），将得到的数字作为index存入散列表中（一般是array或者vector，访问的时间复杂度为O(1)）。举个例子，把字符串“USA”的三个字符转换成int型然后相加，得到233，这个233就可以是一个hash value，然后存入一个数组中（假设数组大小是1000）。看到这你可能会有个问题，要是有个很长的字符串"UUUUUSSSSSAAAA"，那把它的所有字符加起来会得到一个比较大的值，这个值肯定会超出1000的范围，那怎么办？很简单，把这个值除以1000取余就可以了呗，但这么做就会发生冲突，比如会有两个不同的字符串都对应同样的hash value的情况。好吧，那又不得不提几个冲突处理的办法，最常用的是分离链接法，也就是把散列表的成员定义为链表，发生冲突时就对链表进行一次push_back 操作，C++中的定义形式为vector >；还有其他办法处理冲突，比如线性探测和平方探测以及双散列等，在这里不作介绍了。散列是非常有用的一种技术，可以应用在在线拼写检查，编译器符号表存储等实际场合，对于任何一个软件工程师来说都很值得深入学习。
      好吧，稍微扯远了点，让我们回到字符串匹配这个问题上，针对这个问题，我们不需要多复杂的散列函数操作，仅把一个字符串的所有成员字符相加得到一个整数值作为hash value即可，代码如下，

点击(此处)折叠或打开

for (size_t index = 0; index < patternStringLength; index++)
{
hashValue += theString[index];
}

这样我们可以先计算好"USA"的hash value，定义一个变量hashValueModeString来存储这个值，然后进行查找。在待查找的字符串中（"I've never been in USA :("），我们先计算出前三个字符的和sum，然后和“USA”的hash value进行比较，如果不相等，则更新sum的值（即在待查找字符串中后移一位重新计算sum），这里有一个技巧，更新sum时只需要加待查找字符串的下一个字符（假设该字符的位置为pos），再减去位置为pos-模式字符串长度的字符的值即可，代码如下：

点击(此处)折叠或打开

sum += strForSearch[pos];
sum -= strForSearch[pos-patternStringLength];

这样持续查找直到发现sum的值和hashValueModeString相等即可，一旦发现两个值相等，便可以字符串进行compare操作，也就是精确的比较字符串是否相同。
以上就是大致的算法思路，下面是完整代码，

点击(此处)折叠或打开

#include "stdafx.h"
#include <iostream>
using namespace std;
bool compare(char *p1, char *p2, size_t size)
{
for (size_t i = 0; i < size; i++)
{
if (*p1 != * p2)
return false;
}
return true;
}
#define hash(value) value
int _tmain(int argc, _TCHAR* argv[])
{
char str1[] = "USA"; // pattern string
char str2[] = "I've never been in USA :("; // string to be searched using the pattern string
size_t length1 = strlen(str1);
size_t length2 = strlen(str2);
cout << "Pattern string is " << "\"" << str1 << "\"" << endl;
cout << "to be searched in string " << "\"" << str2 << "\"" << endl;
if (length1 > length2)
{
cout << "pattern string was not found" << endl;
return 0;
}
int sum = 0;
// calculate pattern string's hash value
for (size_t index = 0; index < length1; index++)
{
sum += str1[index];
}
int hashValueModeString = hash(sum);
cout << "hash value is " << hashValueModeString << endl;
sum = 0;
for (size_t index = 0; index < length1; index++)
{
sum += str2[index];
}
bool isFound = false;
size_t index = 0;
for (; index < length2 - length1; index++)
{
if (hashValueModeString == hash(sum))
{
// do compare
if (compare(&str2[index], str1, length1))
{
cout << "Pattern string has been found, which starts from position " << index << endl;
isFound = true;
break;
}
else
{
sum -= str2[index];
sum += str2[index+length1];
}
}
else
{
sum -= str2[index];
sum += str2[index+length1];
}
}
if (!isFound && compare(&str2[index], str1, length1))
{
cout << "Pattern string has been found, which starts from position " << index << endl;
isFound = true;
}
if (!isFound)
cout << "pattern string was not found" << endl;
else
cout << "Note: the index is 0 based!" << endl;
return 0;
}

程序输出为
Pattern string is "USA"
to be searched in string "I've never been in USA :("
hash value is 233 // for pattern string
Pattern string has been found, which starts from position 19
Note: the index is 0 based!

阅读(2470) | 评论(0) | 转发(0) |

上一篇：STL中lower_bound的应用

下一篇：C++ STL中vector的实现

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6