我们应该如何使用SPAMASSASSIN的fuzzyocr来分析图片-snowtty-ChinaUnix博客

冰雪塵埃snowtty.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

snowtty

博客访问： 5433735
博文数量： 1144
博客积分： 11974
博客等级：上将
技术积分： 12312
用户组：普通用户
注册时间： 2005-04-13 20:06

文章分类

全部博文（1144）

技术--Oracle&MyS（112）
编程--python编程（21）
编程--perl程序（183）
技术--防火墙类（9）
技术--samba类（7）
技术--apache类（18）
技术--netfilter（7）
工作--工作??（62）
生活--情感生活（116）
学习--英语学习（24）
学习--早先日志（46）
工作--周报总结（36）
学习--学习笔记（45）
技术--Rsync维护（11）
技术--OPENldap（1）
技术--squid维护（9）
技术--DNS 维护（17）
技术--FTP 维护（7）
技术--qmail维护（128）
技术--网络技术（26）
技术--linux 类（183）

openvpn（0）

nagios（10）
编程--awk&sed（11）
编程--shell编程（50）
未分配的博文（15）

文章存档

2017年（2）

2016年（14）

2015年（10）

2014年（28）

2013年（23）

2012年（29）

2011年（53）

2010年（86）

2009年（83）

2008年（43）

2007年（153）

2006年（575）

2005年（45）

我的朋友

相关博文

我们应该如何使用SPAMASSASSIN的fuzzyocr来分析图片

分类： LINUX

2006-10-26 16:58:56

this is what I did with my Fedora Core 4 / Amavisd / Postfix / Spamassassin 


--------start-------- 
1. upgrade the SpamAssassin to 3.1.4 or above (I used v3.1.6) 
2. download the requirement 
a. netpbm (yum install netpbm-progs) 
b. ImageMagick (yum install ImageMagick) 
c. libungif and libungif-progs (yum install libungif-progs) 
d. perl Digest::MD5, String::Approx 

perl -MCPAN -e shell 
install Digest::MD5 
install String::Approx 

e. ExifTool (yum install perl-Image-ExifTool) 

3. Follow this procedure on patching and installing libungif and gocr 
(http://www200.pair.com/mecham/spam/image_spam.html) 

a. libungif 
wget  
tar xzvf libungif-4.1.4.tar.gz 
cd libungif-4.1.4/util 
wget  
patch giftext.c < giftext-segfault.patch 
cd .. 
./configure --prefix=/usr && make && make install 

b. GOCR 
Download, extract, patch, compile and install gocr: 
cd /usr/local/src 
wget  
tar xzvf gocr-0.40.tar.gz 
cd gocr-0.40/src 
wget  
patch pgm2asc.c < patch-gocr-segfault 
cd .. 
./configure --prefix=/usr && make && make install 

c. install FuzzyOCT 
cd /usr/local/src/ 
wget  
tar xzvf fuzzyocr-2.3b.tar.gz 
cd FuzzyOcr-2.3b 

*** We will use a new patch Robert LeBlanc created for this particular version of FuzzyOcr. 

wget  
patch FuzzyOcr.pm < fuzzyocr-23b-hashdb-poison.patch 

*** Then place the files: 

cp FuzzyOcr.pm /etc/mail/spamassassin/ 
cp FuzzyOcr.cf /etc/mail/spamassassin/ 
cp FuzzyOcr.words.sample /etc/mail/spamassassin/FuzzyOcr.words 

d. TEST IT.. follow the procedure on the website 

http://www200.pair.com/mecham/spam/image_spam.html 

e. install Imageinfo plugins for false positive 

** look for Plugin of Spamassassin *** 
cd /usr/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/Plugin (mine) 
wget  
cd /etc/mail/spamassassin 
wget http://www.rulesemporium.com/plugins/imageinfo.cf 


edit v310.pre 
*** and insert (at the bottom): 

loadplugin Mail::SpamAssassin::Plugin::ImageInfo 

*** Edit imageinfo.cf and lower any scores that are 3.0 or more to half their value. This is to help prevent false positives: 

vi imageinfo.cf 

4. check spamassassin by "spamassassin --lint" 

if no error.. restart your amavisd and mail server

other file

FuzzyOCR Walkthrough

Would you like to...

Print this page
Email this to a friend
Post a comment
 
FreeBSD FuzzyOCR SA Plugin

Required packages are netpbm, gocr, imagemagick, giflib and the String::Approx Perl module.

pkg_add -r netpbm
pkg_add -r gocr
pjg_add -r libungif
pkg_add -r ImageMagick
cd /usr/ports/devel/p5-String-Approx
make install clean
cd /usr/local/etc/mail/spamassassin
fetch 
tar zxvf fuzzyocr-latest.tar.gz
cd FuzzyOcr-version

edit FuzzyOcr.cf and change all "/etc/mail/spamassassin/" to "/usr/local/etc/mail/spamassassin/"

I set my focr_logfile to /var/log/FuzzyOcr.log

also edit the FuzzyOcr.pm file.  Search for "$logfile", and you will notice a line calling the log file again.  I just pointed it to the same location.  Not sure why it's called twice.

Now we finish up.

also in the FuzzyOcr.cf file you will need to change the paths of the "Helper Applications" located around line 41.  Change them to the following unless you installed them to /usr/bin/.

focr_bin_giffix /usr/local/bin/giffix
focr_bin_giftext /usr/local/bin/giftext
focr_bin_gifasm /usr/local/bin/gifasm
focr_bin_gifinter /usr/local/bin/gifinter
focr_bin_giftopnm /usr/local/bin/giftopnm
focr_bin_jpegtopnm /usr/local/bin/jpegtopnm
focr_bin_pngtopnm /usr/local/bin/pngtopnm
focr_bin_ppmhist /usr/local/bin/ppmhist
focr_bin_convert /usr/local/bin/convert
focr_bin_identify /usr/local/bin/identify
focr_bin_gocr /usr/local/bin/gocr

Be sure they are all uncommented.

cp FuzzyOcr.* /usr/local/etc/mail/spamassassin/
cd /usr/local/etc/mail/spamassassin/
mv FuzzyOcr.words.sample FuzzyOcr.words
/usr/local/etc/rc.d/sa-spamd.sh restart

*Notice* 
If you are using w0ls0n's cfupdates script, you should remove the rm *.* or otherwise your Fuzzy confs will go bye bye.

#* Writen By mintee 10/17/2007 *


Visitor Comments:

*****************************************************************************

copy一段给你吧。我照做了，行得通。
 ... f2b9aa04c89c2528d1e

this is what I did with my Fedora Core 4 / Amavisd / Postfix / Spamassassin

——–start——–
1. upgrade the SpamAssassin to 3.1.4 or above (I used v3.1.6)
2. download the requirement
a. netpbm (yum install netpbm-progs)
b. ImageMagick (yum install ImageMagick)
c. libungif and libungif-progs (yum install libungif-progs)
d. perl Digest::MD5, String::Approx

perl -MCPAN -e shell
install Digest::MD5
install String::Approx

e. ExifTool (yum install perl-Image-ExifTool)

3. Follow this procedure on patching and installing libungif and gocr
(http://www200.pair.com/mecham/spam/image_spam.html)

a. libungif
wget  ... bungif-4.1.4.tar.gz
tar xzvf libungif-4.1.4.tar.gz
cd libungif-4.1.4/util
wget  ... text-segfault.patch
patch giftext.c < giftext-segfault.patch
cd ..
./configure --prefix=/usr && make && make install

b. GOCR
Download, extract, patch, compile and install gocr:
cd /usr/local/src
wget 
tar xzvf gocr-0.40.tar.gz
cd gocr-0.40/src
wget 
patch pgm2asc.c < patch-gocr-segfault
cd ..
./configure --prefix=/usr && make && make install

c. install FuzzyOCT
cd /usr/local/src/
wget 
tar xzvf fuzzyocr-2.3b.tar.gz
cd FuzzyOcr-2.3b

*** We will use a new patch Robert LeBlanc created for this particular version of FuzzyOcr.

wget  ... hashdb-poison.patch
patch FuzzyOcr.pm < fuzzyocr-23b-hashdb-poison.patch

*** Then place the files:

cp FuzzyOcr.pm /etc/mail/spamassassin/
cp FuzzyOcr.cf /etc/mail/spamassassin/
cp FuzzyOcr.words.sample /etc/mail/spamassassin/FuzzyOcr.words

d. TEST IT.. follow the procedure on the website

http://www200.pair.com/mecham/spam/image_spam.html

e. install Imageinfo plugins for false positive

** look for Plugin of Spamassassin ***
cd /usr/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/Plugin (mine)
wget 
cd /etc/mail/spamassassin
wget http://www.rulesemporium.com/plugins/imageinfo.cf

edit v310.pre
*** and insert (at the bottom):

loadplugin Mail::SpamAssassin:lugin::ImageInfo

*** Edit imageinfo.cf and lower any scores that are 3.0 or more to half their value. This is to help prevent false positives:

vi imageinfo.cf

4. check spamassassin by “spamassassin –lint”

if no error.. restart your amavisd and mail server

10. Test your OCR setup: ¶

To make sure that everything you've installed so far works, test it with a sample image copied from some spam you've received, preferably an image that contains some text. If it's a GIF image, for instance, run it through giftopnm:

giftopnm image001.gif > image001.pnm
giftopnm: too much input data, ignoring extra...
giftopnm: bogus character 0x00, ignoring

Don't be distressed by the informational messages you receive as a result; remember that spammers aren't always going to supply you with a standards-compliant image to work with. In fact, they often hope that a deliberately-malformed image will break your scanner, or at least register an error that leaves your scanner unsure how to classify it, hoping for the benefit of the doubt.

That's where giffix comes in. Try repairing the same image, and then run it through giftopnm again:

giffix image001.gif > image001-fixed.gif
giftopnm image001-fixed.gif > image001-fixed.pnm

This time the warning messages should be gone.

Now run the output through GOCR:

gocr image001-fixed.pnm

A second or two later, you should see a bunch of text as read from the image (presuming it had any to begin with, of course). There's likely to be some other garbage too, and not all of the words will be properly read--in particular, the OCR software has trouble distinguishing 'r' from 'n', and 'I' from 'l', but for the most part you should be able to recognize the words--and that's good enough for our purposes, especially with our "fuzzy" matching tools.

FuzzyOCR Plugin for SpamAssassin ¶

11. Download and install the FuzzyOCR plugin for SpamAssassin: ¶

Now that you've got the underlying tools installed and working, you can download the  for SpamAssassin. To install it, unpack the tarball in a temporary subdirectory and copy FuzzyOcr.pm (the plugin itself) and FuzzyOcr.cf (its configuration file) to your SpamAssassin directory, wherever your local.cf file is located (e.g. /etc/mail/spamassassin).

Note: If there's a loadplugin line at the top of FuzzyOcr.cf, delete it; that line belongs elsewhere, as the next step explains.

12. Tell SpamAssassin to load the FuzzyOCR plugin ¶

Add the following lines to your v310.pre file, so that the plugin gets loaded at startup:

# FuzzyOCR - performs fuzzy Optical Character Recognition on spam images
#
loadplugin FuzzyOcr /etc/mail/spamassassin/FuzzyOcr.pm
loadplugin Mail::SpamAssassin::Timeout

Note that some binary packages of SpamAssassin don't seem to include the Timeout plugin, so if you don't have a Timeout.pm file in your SpamAssassin perl library you may need to download the full SpamAssassin source package for your version and copy the Timeout.pm file from it. If you have to do so, be sure to place the Timeout.pm file in the same place as the rest of your SpamAssassin plugins are found, usually something like /usr/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin.

13. Edit the word list ¶

Copy the FuzzyOcr.words.sample file to FuzzyOcr.words in your SpamAssassin directory and edit it, adding target words to the default list (or removing some):

# Here we define the words to scan for
# Stock
alert
charts
profit
news::0.2
breaking
symbol
alert
stock
investor
international
company
money::0
million
thousand
buy
price::0.2
trade
target
banking
service
recommendation
# Pills
viagra
cialis
xanax
valium
meridia
zanaflex
levitra
medicine
legal::0.2
penis::0
medication
growth
drugs
pharmacy
prescription
# Misc
click here
software
kunde::0.2
volksbank
sparkasse

Notice that target words can optionally contain a second parameter to specify how "exact" the match must be. By default, the plugin uses a threshold of focr_threshold (default: 0.3), specified in the FuzzyOcr.cf file to determine how loosely it should try to match words, but for some words this can be too loose, resulting in false positives. You can override this threshold for specific words in the word list by specifying a threshold value after the word itself, separated by ::. For example:

alpha
beta::0
gamma::0.2

In this example, the word alpha is matched with the usual focr_threshold value. The word beta is matched using a threshold of 0, which is essentially an "exact" match, while gamma is matched with a threshold of 0.2.

As a rule of thumb, if you start to see false positives with a particular word, reduce its threshold by a small amount--say in increments of 0.1--until the false positives stop occurring.

14. Edit the FuzzyOcr.cf file ¶

Logging Options ¶

# Verbosity level (see manual) Attention: Don't set to 0, but to 0.0 for quiet operation. (Default value: 1)
#focr_verbose 1
#
# Logfile (make sure it is writable by the plugin) (Default value: /etc/mail/spamassassin/FuzzyOcr.log)
focr_logfile /etc/mail/spamassassin/FuzzyOcr.log

The plugin logs its activities to focr_logfile, which by default is /etc/mail/spamassassin/FuzzyOcr.log. This file must be writable by your amavis/maia user. Three verbosity levels are supported (via focr_verbose):

```
0.0 : Quiet mode. 
```

1 : All words and their corresponding measured distance ("fuzz"), e.g.

6.0 FUZZY_OCR    BODY: Mail contains an image with common spam text inside
                        Words found:
                        "viagra" with fuzz of 0.2
                        "cialis" with fuzz of 0
                        "viagra" with fuzz of 0.2
                        "levitra" with fuzz of 0
                        (4 word occurrences found)

```
2 : Additional debugging information. 
```

Word Lists ¶

# Here we define the words to scan for (Default value: /etc/mail/spamassassin/FuzzyOcr.words)
focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words
#
# This is the path RELATIVE to the respective home directory for the personalized list
# This list is merged with the global word list on execution (Default value: .spamassassin/fuzzyocr.words)
#focr_personal_wordlist .spamassassin/fuzzyocr.words

You can specify a global list of target words in a text file with one word per line (as explained in Step 13, above). The default for this file is /etc/mail/spamassassin/FuzzyOcr.words, but you can change this to point to any file you like by editing focr_global_wordlist.

While this version of the plugin also offers the ability to add per-user word lists (with the focr_personal_wordlist setting), this has no usefulness in a Maia Mailguard context, where there's only one SpamAssassin user (i.e. your amavis/maia user).

SpamAssassin Version ¶

# Set this to 1 if you are running a version < 3.1.4.
# This will disable a function used in conjunction with animated gifs that isn't available in earlier versions (Default value: 0.0)
#focr_pre314 0.0

The plugin will work with SpamAssassin versions 3.1 and later, but for best performance you should be using version 3.1.4 or later, which includes better support for dealing with animated GIFs. If you're using an earlier version of SpamAssassin, set focr_pre314 to 1 to use a less-efficient (but more compatible) alternative.

Path Settings ¶

#focr_bin_giffix /usr/bin/giffix
#focr_bin_giftext /usr/bin/giftext
#focr_bin_gifasm /usr/bin/gifasm
#focr_bin_gifinter /usr/bin/gifinter
#focr_bin_giftopnm /usr/bin/giftopnm
#focr_bin_jpegtopnm /usr/bin/jpegtopnm
#focr_bin_pngtopnm /usr/bin/pngtopnm
#focr_bin_ppmhist /usr/bin/ppmhist
#focr_bin_convert /usr/bin/convert
#focr_bin_identify /usr/bin/identify
#focr_bin_gocr /usr/bin/gocr

By default, the plugin expects to find all of the utilities for Libungif (giffix, giftext, gifasm, gifinter), Netpbm (giftopnm, jpegtopnm, pngtopnm, ppmhist), ImageMagick (convert, identify), and GOCR (gocr) in /usr/bin. If you've installed these files elsewhere, you'll want to override the path settings for them here so the plugin can find them.

Scan Sets ¶

##### Scansets, comma separated (Default value: $gocr -i -, $gocr -l 180 -d 2 -i -) #####
# Each scanset consists of one or more commands which make text out of pnm input.
# Each scanset is run separately on the PNM data, results are combined in scoring.
#focr_scansets $gocr -i -, $gocr -l 180 -d 2 -i -
#
# To use only one scan with default values, uncomment the next line instead
#focr_scansets $gocr -i -
#
# Some examples for more advanced sets
# This one first uses the standard scan, then a scanset which first reduces the image to 3 colors and then scans it with custom settings
# and then it scans again only with these custom settings
# NOTE: This is for advanced users only, if you have questions how to use this, ask on the ML or on IRC
#focr_scansets $gocr -i -, pnmnorm 2>$errfile | pnmquant 3 2>>$errfile | pnmnorm 2>>$errfile | $gocr -l 180 -d 2 -i -, $gocr -l 180 -d 2 -i -

This version of the plugin lets you perform multiple OCR scans on the same image, using different scan resolutions and tolerances. This more thorough approach results in better word-matches, at the cost of more processing time. This is particularly useful for handling images that use odd combinations of foreground and background colours (e.g. white text on coloured backgrounds), lines, dots and other "noise" patterns intended to throw off OCR engines. Scanning images with just one resolution and tolerance setting is necessarily a compromise--you end up choosing settings that work well for most images, but a lot of the edge cases slip through. By running the OCR routines two or three times with different scanning parameters, the plugin can catch more of those edge cases.

The focr_scansets setting lets you specify the command-line options to GOCR and other utilities for one or more scan sets, separated by commas.

If you're just interested in doing a single scan at the default resolution and tolerances, you can still do so by specifying just one scan set:

focr_scansets $gocr -i -

The default, however, is to run two scan sets--one at the default resolution and tolerances, the second at a grey level of 180 and dust size of 2 pixels:

focr_scansets $gocr -i -, $gocr -l 180 -d 2 -i -

As the commented "advanced" example illustrates, you can specify any command-line you like as a scanset, not just GOCR commands. You can chain image-manipulation tools together as desired, as long as the chain begins with the image in PNM format and ends with a call to GOCR.

If you decide to experiment with command-line options and tool-chains, the plugin's author offers the following advice:

pnmnorm, pnminvert and pnmquant are useful with white text or text with many colors.

If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr.

The -l setting often helps, try values like 180, 140, or 100.

Miscellaneous Settings ¶

# Timeout for the plugin, in seconds. (Maximum runtime of the plugin) (Default value: 10)
#focr_timeout 10
#
# Default detection treshold (see manual) (Default value: 0.3) (Can be changed on a per word basis in the wordlist).
#focr_threshold 0.3
#
# This is the score for a hit after focr_counts_required matches
#focr_base_score 4
#
# This is the additional score for every additional match after focr_counts_required matches (Default value: 1)
#focr_add_score 1
#
# This is the score to give for a wrong content-type (e.g. JPEG image but content type says GIF) (Default value: 1.5)
#focr_wrongctype_score 1.5
#
# This is the score to give for a corrupted image (This currently affects only GIF images) (Default value: 2.5)
#focr_corrupt_score 2.5
#
# This is the score to give for a corrupted unfixable image (This currently affects only GIF images) (Default value: 5)
#focr_corrupt_unfixable_score 5
#
# This is used to disable the OCR engine if the message has already more points than this value (Default value: 10)
#focr_autodisable_score 10
#
# Number of minimum matches before the rule scores (Default value: 2)
#focr_counts_required 2
#
# Specifies, how many frames an animated gif must contain, so the second (less resource consuming) animated gif test is used. (Default value: 5)
#focr_gif_max_frames 5

The OCR process can take time, especially if you're running multiple scan sets. With the focr_timeout setting however, you can set an upper bound (in seconds) on how much time the plugin will spend before returning its results, if any (default: 10).

The focr_threshold setting determines how "fuzzy" a word match is allowed to be. Higher settings result in more matches, but also more false positives; lower settings result in fewer matches, and more false negatives. Finding the right tolerance is the key, and it may vary from word to word, depending on how good the OCR engine is. As explained above in Step 13, you can specify this tolerance on a per-word basis in the word list as well.

It's useful to understand how the plugin assigns its score value to the FUZZY_OCR rule. The rule is only triggered if there are at least focr_counts_required word matches (default: 2) in the image. At that point, the rule's score becomes focr_base_score + focr_add_score for every additional word match (default: 4 + 1/word after the second match). At default values, then, two matching words would score a total of 4 points; three matching words would score 5 points; four would score 6 points, etc. Feel free to adjust these values to your tastes. Don't forget to uncomment these values if you change them!

The focr_wrongctype_score setting lets you penalize mail that contains images that claim to be one type but are actually another, such as a GIF that's advertised as a JPEG in the MIME content-type header. focr_corrupt_score similarly penalizes malformed GIF images, and focr_corrupt_unfixable_score penalizes GIF images so badly malformed that they can't be repaired. Eventually perhaps these will penalize malformed images of other types.

The focr_autodisable_score setting is more controversial. In principle it's a way to save some processing cycles by avoiding an OCR scan if there are already enough other rules triggering on the mail to achieve this minimum score (default: 10). The downside is that this mucks with efforts to statistically measure the performance of the OCR-based rules, since there's no longer any guarantee that these rules will be called every time they should be. Upcoming Maia features such as Dynamic Score Balancing will not work properly if this setting is used, so unless you're truly strapped for processor cycles it's advisable to set this value to an unrealistically high value (e.g. 999) to effectively disable it.

When it comes to handling animated GIFs, the plugin can use one of two tools to unpack the frames--ImageMagick's convert or Netpbm's gifasm. The convert utility is fast, but for images with a lot of frames the gifasm tool is more efficient. The focr_gif_max_frames setting lets you determine the frame-count at which the gifasm tool should be used instead of convert (default: 5). If you want to use gifasm all the time, of course, just set this to 1.

Image Hash Database ¶

##### Image Hash Database settings (Experimental, disabled by default) #####
#
# Set this to 1 to enable the Image Hash database feature (Default value: 0.0)
#focr_enable_image_hashing 0.0
#
# The score is saved with the hash in the database, so no extra scoring for a db hit is required.
#
# If the image hash database feature is enabled, specify the file here to use as database (Default value: /etc/mail/spamassassin/FuzzyOcr.hashdb)
#focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb
#
# Automatically add hashes of spam images recognized by OCR to the Image Hash database, to disable, set to 0.0 (Default value: 1)
#focr_hashing_learn_scanned 1

This version of the plugin includes a custom hash database that serves as a local cache for previously recognized images, so that the OCR engine won't need to be called if the image has been received before. The default location of this database is /etc/mail/spamassassin/FuzzyOcr.hashdb, but you can change this by setting an explicit path for focr_digest_db. The feature is still considered "experimental" and is disabled by default, so if you wish to use it, you need to enable it by setting focr_enable_image_hashing to 1. If you do enable it, you'll want to set focr_hashing_learn_scanned to 1 as well, to ensure that the plugin not only reads the database but writes to it as well. Almost needless to say, your amavis/maia user needs to be able to write to this file in that case.

15. Patch the FuzzyOcr.pm file ¶

There are a couple of small fixes to be made to the FuzzyOcr.pm file to make the hashing database work properly. Until these are eventually corrected by the plugin's author, the fix is to apply the following patches:

*** FuzzyOcr.pm-orig  2006-08-27 04:35:12.000000000 -0700
--- FuzzyOcr.pm       2006-08-30 15:10:17.934275225 -0700
***************
*** 490,494 ****
      flock( DB, LOCK_EX );
      seek( DB, 0, 2 );
!     print DB "$score::$digest\n";
      flock( DB, LOCK_UN );
      close(DB);
--- 490,494 ----
      flock( DB, LOCK_EX );
      seek( DB, 0, 2 );
!     print DB "${score}::${digest}\n";
      flock( DB, LOCK_UN );
      close(DB);

Copy that block of text to a file called hashdb.patch or somesuch, and apply it with:

patch -p0 < hashdb.patch

Note: If the patching process fails, don't worry--that most likely just means the plugin author has fixed the problem and updated the tarball, so the version you've downloaded already contains the fix.

The second  is a safeguard against the "poisoning" of your hash database. Without this patch, spammers could include innocuous images (e.g. logos from businesses like eBay, Amazon, PayPal, etc.) alongside their spam images, and the FuzzyOCR plugin would add those to its hash database as well. The patch ensures that only images that contain at least one matching word from the word-list get added to the hash database. Apply the patch as usual:

patch -p0 < fuzzyocr-23b-hashdb-poison.patch

16. Test the installation ¶

Now you can verify that you've got all the paths set properly and that you have all of the necessary pieces in place. As your amavis/maia user, run:

spamassassin -D --lint

If everything is working properly, this shouldn't produce any errors, and in particular you should see something like:

...
plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm
plugin: registered FuzzyOcr=HASH(0xb9fde84)
plugin: loading Mail::SpamAssassin::Timeout from @INC
plugin: registered Mail::SpamAssassin::Timeout=HASH(0xb18501c)
...

If for some reason you don't see the FuzzyOCR module being loaded, it may be because of some security-related settings in your operating system that may require Perl modules to have their execute bits set. Usually this is unnecessary (and inadvisable), but one Maia user has reported that this was necessary to get the plugin to load properly:

chmod 744 FuzzyOcr.pm

The plugin also comes with a number of test emails in the samples subdirectory. As your amavis/maia user, you can test each of these to make sure the plugin detects them properly.

spamassassin -t < animated-gif.eml
...
  21 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "alert" in 4 lines
                            "charts" in 1 lines
                            "symbol" in 1 lines
                            "alert" in 4 lines
                            "stock" in 2 lines
                            "company" in 3 lines
                            "trade" in 1 lines
                            "xanax" in 1 lines
                            "meridia" in 1 lines
                            "growth" in 1 lines
                            (19 word occurrences found)

spamassassin -t < corrupted-gif.eml
...
 1.5 FUZZY_OCR_WRONG_CTYPE  BODY: Mail contains an image with wrong
                            content-type set
                            Image has format "GIF" but content-type is
                            "image/jpeg"
 2.5 FUZZY_OCR_CORRUPT_IMG  BODY: Mail contains a corrupted image
                            Corrupt image: GIF-LIB error: Image is
                            defective, decoding aborted.
  12 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "alert" in 1 lines
                            "alert" in 1 lines
                            "stock" in 2 lines
                            "investor" in 1 lines
                            "company" in 1 lines
                            "trade" in 1 lines
                            "target" in 1 lines
                            "service" in 1 lines
                            "recommendation" in 1 lines
                            (10 word occurrences found)

spamassassin -t < jpeg.eml
...
 4.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "viagra" in 2 lines
                            "cialis" in 1 lines
                            "levitra" in 1 lines
                            (4 word occurrences found)

spamassassin -t < png.eml
...
  28 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "alert" in 2 lines
                            "news" in 2 lines
                            "symbol" in 1 lines
                            "alert" in 2 lines
                            "stock" in 1 lines
                            "investor" in 3 lines
                            "company" in 2 lines
                            "buy" in 1 lines
                            "price" in 2 lines
                            "trade" in 2 lines
                            "target" in 2 lines
                            "service" in 2 lines
                            "recommendation" in 1 lines
                            "levitra" in 1 lines
                            "software" in 2 lines
                            (26 word occurrences found)

Next you can verify that the hashing database is working properly (if you've enabled it, that is), by trying to test one of those emails a second time:

spamassassin -t < animated-gif.eml
...
  21 FUZZY_OCR_KNOWN_HASH   BODY: Mail contains an image with known hash
                            Hash
                            "1009752:453:743:28::255:255:255:255:308010::0
                            :0:0:0:17164::0:64:128:52:3217::128:0:0:38:303
                            1::128:128:0:113:1452" is in the database.

Note: If the animated-gif.eml test fails, try setting focr_gif_max_frames to 1 and try again. This will use an alternate method for unpacking the image frames that may work better for you than the default.

Note: If the png.eml test fails, you're probably using a more alpha version of this plugin that has a small typo. Download the latest version of the plugin, as the tarball has been updated with the fix.

17. Tell Maia about the new rules ¶

If everything is working properly, you'll want to run the load-sa-rules script to make sure that Maia discovers the new rules you just added (in the FuzzyOcr.cf file). There should be a handful of new rules:

[load-sa-rules] Adding new rule: FUZZY_OCR (Mail contains an image with common spam text inside)
[load-sa-rules] Adding new rule: FUZZY_OCR_WRONG_CTYPE (Mail contains an image with wrong content-type set)
[load-sa-rules] Adding new rule: FUZZY_OCR_CORRUPT_IMG (Mail contains a corrupted image)
[load-sa-rules] Adding new rule: FUZZY_OCR_KNOWN_HASH (Mail contains an image with known hash)
[load-sa-rules] 4 new rules added (3214 rules total), all scores updated.

18. Restart amavisd-maia ¶

Now you can restart amavisd-maia and start looking for these rules in your log files, and in Maia's mail viewer, once you begin receiving mail items that contain images with text in them. The processing time on such items will be a few seconds longer than usual, but mail items without images in them won't be affected, since the FuzzyOCR plugin won't be called in those cases.

As a side-note, you may notice some unusual warnings when you run the process-quarantine script, such as:

Subroutine new redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 116.
Subroutine parse_config redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 126.
Subroutine dummy_check redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 223.
Subroutine fuzzyocr_check redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 227.
Subroutine load_global_words redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 237.
Subroutine load_personal_words redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 255.
Subroutine parse_scansets redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 278.
Subroutine max redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 285.
Subroutine reorder redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 293.
Subroutine pipe_io redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 298.
Subroutine handle_error redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 410.
Subroutine logfile redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 416.
Subroutine check_image_hash_db redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 435.
Subroutine add_image_hash_db redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 475.
Subroutine calc_image_hash redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 497.
Subroutine debuglog redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 537.
Subroutine wrong_ctype redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 543.
Subroutine corrupt_img redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 562.
Subroutine known_img_hash redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 587.
Subroutine check_fuzzy_ocr redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 602.

These warnings are annoying but harmless noise that results from what seems to be an oversight on the part of the plugin author. We can hope that he will eventually modify his plugin to make it behave properly when more than one SpamAssassin object is loaded into memory at the same time, but until then, you can safely ignore these warnings. A workaround in the process-quarantine script that ships with Maia 1.0.2 should also take care of this, if the plugin author hasn't fixed it by then.

阅读(6733) | 评论(0) | 转发(0) |

上一篇：spamassassin论坛

下一篇：Installing DCC

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6