Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1789845
  • 博文数量: 297
  • 博客积分: 285
  • 博客等级: 二等列兵
  • 技术积分: 3006
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-06 22:04
个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文(297)

文章存档

2020年(11)

2019年(15)

2018年(43)

2017年(79)

2016年(79)

2015年(58)

2014年(1)

2013年(8)

2012年(3)

分类: Python/Ruby

2018-08-21 20:24:08

#Bayesian Methods to create Anti-spammer
We can construct P(Spam | Word) for every (meaningful) word we encounter
during training.
Then multiply these together when analyzing a new mail to get the probability of it being spam.
Assumes the presence of different words are independent of each other - one reason this is called "Naive Bayes"

理论就是:  不考虑词和词之间的关系,简单的将每个词贡献的'spam‘值算出来,最后根据所有的这些词贡献出的'spam'值来分析新的邮件。

下面则是代码
首先是使用pandas读入数据,然后使用scikit-learn 来build 一个spam classifier, 最后使用这个spam classifier 来predict两个字符串到底应该归类spam 或者ham.

点击(此处)折叠或打开

  1. #!/usr/bin/env python3
  2. # -*- coding: utf-8 -*-
  3. # Author: hezhb
  4. # Created Time: Tue 01 May 2018 11:49:35 AM CST

  5. import os
  6. import io
  7. import numpy as np
  8. from pandas import DataFrame
  9. from sklearn.feature_extraction.text import CountVectorizer
  10. from sklearn.naive_bayes import MultinomialNB

  11. def readFiles(path):
  12.     for root, dirnames, filenames in os.walk(path):
  13.         for filename in filenames:
  14.             path = os.path.join(root, filename)
  15.     
  16.             inBody = False
  17.             lines = []
  18.             
  19.             f = io.open(path, 'r', encoding='latin1')
  20.             for line in f:
  21.                 if inBody:
  22.                     lines.append(line)
  23.                 elif line == '\n':
  24.                     inBody = True
  25.                 
  26.             f.close()
  27.             message = '\n'.join(lines)
  28.             yield path, message
  29.     

  30. def dataFrameFromDirectory(path, classification):
  31.     rows = []
  32.     index = []
  33.                         
  34.     for filename, message in readFiles(path):
  35.         rows.append({'message':message, 'class':classification})
  36.         index.append(filename)
  37.     
  38.     return DataFrame(rows, index=index)


  39. PATH='./hands-on/emails/'


  40. data = DataFrame({'message':[], 'class':[]})
  41. data = data.append(dataFrameFromDirectory(PATH+'spam', 'spam'))
  42. data = data.append(dataFrameFromDirectory(PATH+'ham', 'ham'))
  43. #print(data.head())

  44. """
  45. Now we will use CountVectorizer to split up each message into its list of words
  46. and throw that into a MultinomialNB classifier, call fit() and we've got
  47. a trained spam filter ready to go.
  48. """

  49. vectorizer = CountVectorizer(encoding='latin1')
  50. counts = vectorizer.fit_transform(data['message'].values)
  51. classifier = MultinomialNB()
  52. targets = data['class'].values
  53. classifier.fit(counts, targets)

  54. #Now can try this classifier out
  55. examples = ['Free viagra Now', 'Hi Bob, how about a game of golf tommorrow.']
  56. example_counts = vectorizer.transform(examples)
  57. predictions = classifier.predict(example_counts)
  58. print(predictions)


阅读(1086) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~