Chinaunix首页 | 论坛 | 博客
  • 博客访问: 6611
  • 博文数量: 4
  • 博客积分: 120
  • 博客等级: 入伍新兵
  • 技术积分: 50
  • 用 户 组: 普通用户
  • 注册时间: 2008-11-03 17:32







2009-03-23 21:24:12

The best way to describe the Ratcliff/Obershelp pattern-matching algorithm, in using conventional computer terminology, is as a wild-card search that doesn't require wild cards. Instead, the algorithm creates its own wildcards, based on the closest matches found between the strings. Specifically, the algorithm works by examining two strings passed to it and locating the largest group of characters in common. The algorithm uses this group of characters as an anchor between the two strings. The algorithm then places any group of characters found to the left or the right of this anchor on a stack for further examination. This procedure is repeated for all substrings on the stack until there is nothing left to examine. The algorithm calculates the score returned as twice the number of characters found in common divided by the total number of characters in the two strings; the score is returned as an integer, reflecting a percentage match.

For example, suppose you want to compare the similarity between the word `Pennsylvania' and a mangled spelling as `Pencilvaneya.' The largest common group of characters that the algorithm would find is `lvan.' The two sub-groups remaining to the left are `Pennsy' and `Penci,' and to the right are `ia' and`eya.' The algorithm places both of these string sections on the stack to be examined, and advances the current score to eight, two times the number of characters found in common. The substrings `ia' and `eya' are next to come off of the stack and are then examined. The algorithm finds one character in common: a. The score is advanced to ten. The substrings to the left---'i' and `ey'---are placed on the stack, but then are immediately removed and determined to contain no character in common. Next, the algorithm pulls `Pennsy' and `Penci' off of the stack. The largest common substring found is `Pen.' The algorithm advances the score by 6 so that it is now 16. There is nothing to the left of `Pen,' but to the right are the substrings `nsy' and `ci,' which are pushed onto the stack. When the algorithm pulls off `nsy' and `ci' next, it finds no characters in common. The stack is now empty and the algorith ready to return the similarity value found. There was a score of 16 out of a total of 24. This result means that the two strings were 67 percent alike.

阅读(420) | 评论(0) | 转发(0) |