全部博文(1144)
分类: LINUX
2006-12-14 22:06:06
Installing FuzzyOCR 2.3b ¶
The FuzzyOCR Plugin for SpamAssassin improves somewhat upon the standard OCR Plugin, in that it is capable of performing "fuzzy" matching of text strings. This makes it able to handle the innate inaccuracies of OCR engines, spelling mistakes, and deliberate obfuscation of words by spammers, without having to write a lot of explicit regular expression patterns to catch these variations.
This plugin does everything that the original plugin does, minus the detection of malformed JPEGs and PNGs. It does detect malformed GIFs in this version, though, and presumably it will eventually detect malformed images of other types as well.
In addition, this version offers the ability to run multiple OCR scans of the same image at different resolutions and tolerances for more thorough analysis, and a local database to cache hashes of images recognized as spam so that the resource-expensive OCR process can be avoided in the future for images that have already been seen.
The FuzzyOCR plugin is in very active development, so newer versions may also exist at . If you're feeling particularly conservative, you may wish to run the older 2.1c release.
Note for Gentoo users: ¶
has provided an ebuild for Gentoo that covers the installation steps described in this document, including patched versions of gocr, giftext, and FuzzyOcr.pm. Download his fuzzyocr-gentoo-2.tar.bz2 package and unpack it into /usr/local, then enable the overlay if necessary in /etc/make.conf:
PORTDIR_OVERLAY="/usr/local/portage"
From there, install the mail-filter/spamassassin-fuzzyocr package. If the install fails due to a digest mismatch, this just means the FuzzyOCR plugin author has updated the 2.3b tarball without changing the version number. To correct this if it happens, do this:
cd /usr/local/portage/mail-filter/spamassassin-fuzzyocr ebuild spamassassin-fuzzyocr-2.3b.ebuild digest
Netpbm ¶
1. Install the Netpbm tools and libraries: ¶
The first thing you'll need is a set of image manipulation tools, provided by the popular . If you're not downloading the full source code, you'll at least require the binaries themselves, as well as the libraries and header files. These packages might be referred to as netpbm-progs and netpbm-devel, libnetpbm and netpbm or somesuch, depending on your distribution.
ImageMagick ¶
2. Install the ImageMagick suite: ¶
The FuzzyOCR plugin uses the convert utility from the ImageMagick suite to unpack animated GIF images that spammers use to try to confuse ordinary OCR tools. Without it, only the first frame of an animated GIF would be scanned, and clever spammers simply leave the first frame blank to exploit this. Your favourite distribution most likely has a binary package available, possibly called ImageMagick or imagemagick.
Libungif ¶
3. Install the Libungif tools and libraries ¶
Next you'll want to install the and its associated tools. Specifically, it's the giffix and giftext utilities you want, in order to be able to "fix" prematurely truncated GIF images, since spammers aren't known for providing well-formed images. While this library is often available in a binary package, you'll want the source package in this case, since there's a small patch to apply to it in the next step.
4. Patch the Libungif source code ¶
In order to harden the giftext utility against a particular exploit that can cause it to crash, a small is required. Download the patch and apply it to the Libungif source code as follows:
cd util patch -p0 < giftext-segfault.patch
5. Compile and install Libungif: ¶
Once the patch has been applied, build the Libungif library and utilities:
./configure --prefix=/usr make make install
String::Approx ¶
6. Install the String::Approx perl module: ¶
The perl module provides "fuzzy" matching for text strings, which is helpful for detecting misspelled words, words that an OCR engine misreads, and words that spammers intentionally obfuscate. For instance, '1' and 'l' look very similar to an OCR engine, so a word like "email" could be seen as "emai1" by mistake, but with fuzzy matching the two words would be seen as equivalent. You can get this module from your favourite distribution's repository, or directly from CPAN.
GOCR ¶
7. Download the GOCR source code: ¶
The OCR process is handled by , but while there are binary packages available for a number of distributions, you'll want the source package in this case, because there's a small patch you need to apply.
8. Patch the GOCR source code: ¶
Some grey images have been known to trigger segmentation faults in GOCR 0.40, so a small has been devised to fix this vulnerability. Once again, this is not much of an issue in most normal OCR environments, since choking on an input image doesn't usually have serious consequences, but in a spam-filtering environment we need to be more graceful in how we handle such situations.
Once you've downloaded the GOCR 0.40 source code package and unpacked it, go to the src subdirectory and apply the patch to the pgm2asc.c file:
cd src patch -p1 < patch-gocr-segfault
9. Compile and install GOCR: ¶
From there, building GOCR is straightforward:
./configure --prefix=/usr make make install
10. Test your OCR setup: ¶
To make sure that everything you've installed so far works, test it with a sample image copied from some spam you've received, preferably an image that contains some text. If it's a GIF image, for instance, run it through giftopnm:
giftopnm image001.gif > image001.pnm giftopnm: too much input data, ignoring extra... giftopnm: bogus character 0x00, ignoring
Don't be distressed by the informational messages you receive as a result; remember that spammers aren't always going to supply you with a standards-compliant image to work with. In fact, they often hope that a deliberately-malformed image will break your scanner, or at least register an error that leaves your scanner unsure how to classify it, hoping for the benefit of the doubt.
That's where giffix comes in. Try repairing the same image, and then run it through giftopnm again:
giffix image001.gif > image001-fixed.gif giftopnm image001-fixed.gif > image001-fixed.pnm
This time the warning messages should be gone.
Now run the output through GOCR:
gocr image001-fixed.pnm
A second or two later, you should see a bunch of text as read from the image (presuming it had any to begin with, of course). There's likely to be some other garbage too, and not all of the words will be properly read--in particular, the OCR software has trouble distinguishing 'r' from 'n', and 'I' from 'l', but for the most part you should be able to recognize the words--and that's good enough for our purposes, especially with our "fuzzy" matching tools.
FuzzyOCR Plugin for SpamAssassin ¶
11. Download and install the FuzzyOCR plugin for SpamAssassin: ¶
Now that you've got the underlying tools installed and working, you can download the for SpamAssassin. To install it, unpack the tarball in a temporary subdirectory and copy FuzzyOcr.pm (the plugin itself) and FuzzyOcr.cf (its configuration file) to your SpamAssassin directory, wherever your local.cf file is located (e.g. /etc/mail/spamassassin).
Note: If there's a loadplugin line at the top of FuzzyOcr.cf, delete it; that line belongs elsewhere, as the next step explains.
12. Tell SpamAssassin to load the FuzzyOCR plugin ¶
Add the following lines to your v310.pre file, so that the plugin gets loaded at startup:
# FuzzyOCR - performs fuzzy Optical Character Recognition on spam images # loadplugin FuzzyOcr /etc/mail/spamassassin/FuzzyOcr.pm loadplugin Mail::SpamAssassin::Timeout
Note that some binary packages of SpamAssassin don't seem to include the Timeout plugin, so if you don't have a Timeout.pm file in your SpamAssassin perl library you may need to download the full SpamAssassin source package for your version and copy the Timeout.pm file from it. If you have to do so, be sure to place the Timeout.pm file in the same place as the rest of your SpamAssassin plugins are found, usually something like /usr/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin.
13. Edit the word list ¶
Copy the FuzzyOcr.words.sample file to FuzzyOcr.words in your SpamAssassin directory and edit it, adding target words to the default list (or removing some):
# Here we define the words to scan for # Stock alert charts profit news::0.2 breaking symbol alert stock investor international company money::0 million thousand buy price::0.2 trade target banking service recommendation # Pills viagra cialis xanax valium meridia zanaflex levitra medicine legal::0.2 penis::0 medication growth drugs pharmacy prescription # Misc click here software kunde::0.2 volksbank sparkasse
Notice that target words can optionally contain a second parameter to specify how "exact" the match must be. By default, the plugin uses a threshold of focr_threshold (default: 0.3), specified in the FuzzyOcr.cf file to determine how loosely it should try to match words, but for some words this can be too loose, resulting in false positives. You can override this threshold for specific words in the word list by specifying a threshold value after the word itself, separated by ::. For example:
alpha beta::0 gamma::0.2
In this example, the word alpha is matched with the usual focr_threshold value. The word beta is matched using a threshold of 0, which is essentially an "exact" match, while gamma is matched with a threshold of 0.2.
As a rule of thumb, if you start to see false positives with a particular word, reduce its threshold by a small amount--say in increments of 0.1--until the false positives stop occurring.
14. Edit the FuzzyOcr.cf file ¶
Logging Options ¶
# Verbosity level (see manual) Attention: Don't set to 0, but to 0.0 for quiet operation. (Default value: 1) #focr_verbose 1 # # Logfile (make sure it is writable by the plugin) (Default value: /etc/mail/spamassassin/FuzzyOcr.log) focr_logfile /etc/mail/spamassassin/FuzzyOcr.log
The plugin logs its activities to focr_logfile, which by default is /etc/mail/spamassassin/FuzzyOcr.log. This file must be writable by your amavis/maia user. Three verbosity levels are supported (via focr_verbose):
0.0 : Quiet mode.
1 : All words and their corresponding measured distance ("fuzz"), e.g.
6.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: "viagra" with fuzz of 0.2 "cialis" with fuzz of 0 "viagra" with fuzz of 0.2 "levitra" with fuzz of 0 (4 word occurrences found)
2 : Additional debugging information.
Word Lists ¶
# Here we define the words to scan for (Default value: /etc/mail/spamassassin/FuzzyOcr.words) focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words # # This is the path RELATIVE to the respective home directory for the personalized list # This list is merged with the global word list on execution (Default value: .spamassassin/fuzzyocr.words) #focr_personal_wordlist .spamassassin/fuzzyocr.words
You can specify a global list of target words in a text file with one word per line (as explained in Step 13, above). The default for this file is /etc/mail/spamassassin/FuzzyOcr.words, but you can change this to point to any file you like by editing focr_global_wordlist.
While this version of the plugin also offers the ability to add per-user word lists (with the focr_personal_wordlist setting), this has no usefulness in a Maia Mailguard context, where there's only one SpamAssassin user (i.e. your amavis/maia user).
SpamAssassin Version ¶
# Set this to 1 if you are running a version < 3.1.4. # This will disable a function used in conjunction with animated gifs that isn't available in earlier versions (Default value: 0.0) #focr_pre314 0.0
The plugin will work with SpamAssassin versions 3.1 and later, but for best performance you should be using version 3.1.4 or later, which includes better support for dealing with animated GIFs. If you're using an earlier version of SpamAssassin, set focr_pre314 to 1 to use a less-efficient (but more compatible) alternative.
Path Settings ¶
#focr_bin_giffix /usr/bin/giffix #focr_bin_giftext /usr/bin/giftext #focr_bin_gifasm /usr/bin/gifasm #focr_bin_gifinter /usr/bin/gifinter #focr_bin_giftopnm /usr/bin/giftopnm #focr_bin_jpegtopnm /usr/bin/jpegtopnm #focr_bin_pngtopnm /usr/bin/pngtopnm #focr_bin_ppmhist /usr/bin/ppmhist #focr_bin_convert /usr/bin/convert #focr_bin_identify /usr/bin/identify #focr_bin_gocr /usr/bin/gocr
By default, the plugin expects to find all of the utilities for Libungif (giffix, giftext, gifasm, gifinter), Netpbm (giftopnm, jpegtopnm, pngtopnm, ppmhist), ImageMagick (convert, identify), and GOCR (gocr) in /usr/bin. If you've installed these files elsewhere, you'll want to override the path settings for them here so the plugin can find them.
Scan Sets ¶
##### Scansets, comma separated (Default value: $gocr -i -, $gocr -l 180 -d 2 -i -) ##### # Each scanset consists of one or more commands which make text out of pnm input. # Each scanset is run separately on the PNM data, results are combined in scoring. #focr_scansets $gocr -i -, $gocr -l 180 -d 2 -i - # # To use only one scan with default values, uncomment the next line instead #focr_scansets $gocr -i - # # Some examples for more advanced sets # This one first uses the standard scan, then a scanset which first reduces the image to 3 colors and then scans it with custom settings # and then it scans again only with these custom settings # NOTE: This is for advanced users only, if you have questions how to use this, ask on the ML or on IRC #focr_scansets $gocr -i -, pnmnorm 2>$errfile | pnmquant 3 2>>$errfile | pnmnorm 2>>$errfile | $gocr -l 180 -d 2 -i -, $gocr -l 180 -d 2 -i -
This version of the plugin lets you perform multiple OCR scans on the same image, using different scan resolutions and tolerances. This more thorough approach results in better word-matches, at the cost of more processing time. This is particularly useful for handling images that use odd combinations of foreground and background colours (e.g. white text on coloured backgrounds), lines, dots and other "noise" patterns intended to throw off OCR engines. Scanning images with just one resolution and tolerance setting is necessarily a compromise--you end up choosing settings that work well for most images, but a lot of the edge cases slip through. By running the OCR routines two or three times with different scanning parameters, the plugin can catch more of those edge cases.
The focr_scansets setting lets you specify the command-line options to GOCR and other utilities for one or more scan sets, separated by commas.
If you're just interested in doing a single scan at the default resolution and tolerances, you can still do so by specifying just one scan set:
focr_scansets $gocr -i -
The default, however, is to run two scan sets--one at the default resolution and tolerances, the second at a grey level of 180 and dust size of 2 pixels:
focr_scansets $gocr -i -, $gocr -l 180 -d 2 -i -
As the commented "advanced" example illustrates, you can specify any command-line you like as a scanset, not just GOCR commands. You can chain image-manipulation tools together as desired, as long as the chain begins with the image in PNM format and ends with a call to GOCR.
If you decide to experiment with command-line options and tool-chains, the plugin's author offers the following advice:
pnmnorm, pnminvert and pnmquant are useful with white text or text with many colors.
If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr.
The -l setting often helps, try values like 180, 140, or 100.
Miscellaneous Settings ¶
# Timeout for the plugin, in seconds. (Maximum runtime of the plugin) (Default value: 10) #focr_timeout 10 # # Default detection treshold (see manual) (Default value: 0.3) (Can be changed on a per word basis in the wordlist). #focr_threshold 0.3 # # This is the score for a hit after focr_counts_required matches #focr_base_score 4 # # This is the additional score for every additional match after focr_counts_required matches (Default value: 1) #focr_add_score 1 # # This is the score to give for a wrong content-type (e.g. JPEG image but content type says GIF) (Default value: 1.5) #focr_wrongctype_score 1.5 # # This is the score to give for a corrupted image (This currently affects only GIF images) (Default value: 2.5) #focr_corrupt_score 2.5 # # This is the score to give for a corrupted unfixable image (This currently affects only GIF images) (Default value: 5) #focr_corrupt_unfixable_score 5 # # This is used to disable the OCR engine if the message has already more points than this value (Default value: 10) #focr_autodisable_score 10 # # Number of minimum matches before the rule scores (Default value: 2) #focr_counts_required 2 # # Specifies, how many frames an animated gif must contain, so the second (less resource consuming) animated gif test is used. (Default value: 5) #focr_gif_max_frames 5
The OCR process can take time, especially if you're running multiple scan sets. With the focr_timeout setting however, you can set an upper bound (in seconds) on how much time the plugin will spend before returning its results, if any (default: 10).
The focr_threshold setting determines how "fuzzy" a word match is allowed to be. Higher settings result in more matches, but also more false positives; lower settings result in fewer matches, and more false negatives. Finding the right tolerance is the key, and it may vary from word to word, depending on how good the OCR engine is. As explained above in Step 13, you can specify this tolerance on a per-word basis in the word list as well.
It's usefu