python_nlp一个有趣的category_predictor -hmchzb19-ChinaUnix博客

Linuxer

首页　| 　博文目录　| 　关于我

hmchzb19

博客访问： 1812953
博文数量： 297
博客积分： 285
博客等级：二等列兵
技术积分： 3006
用户组：普通用户
注册时间： 2010-03-06 22:04

个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文（297）

machine_learning（16）
PYthon_Design_Pa（1）
数学（1）
Data Struct（1）
scheme（3）
Container（1）
sqlite3（1）
firefox（4）
Tor（1）
java（30）
生活（2）
测试生涯（1）
互联网（4）
algorithm（4）
ubuntu（4）
安全和kali （35）
windows（5）
cloud_manage（3）
tcp/ip（1）
security（5）
Linux（74）
python（70）
C（9）
postgresql（5）
shell（3）
db2（3）
oracle（3）
Power-VM虚拟化（7）
未分配的博文（0）

文章存档

2020年（11）

2019年（15）

2018年（43）

2017年（79）

2016年（79）

2015年（58）

2014年（1）

2013年（8）

2012年（3）

我的朋友

相关博文

python_nlp一个有趣的category_predictor

分类： Python/Ruby

2018-02-28 17:45:57

最近在看一个Sequence Learning 的视频，然后研究了别人的代码，因为我的"typo" 导致代码有点”歪“，但是揭露了一个有趣的现象.

1. 这个代码是一个category_predictor, 需要用到下面的这个库. -> sklearn
在kali linux上安装则需要执行这个命令：

点击(此处)折叠或打开

pip3 install sklearn

2. 代码如下：

点击(此处)折叠或打开

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: hezhb
# Created Time: Mon 26 Feb 2018 04:27:53 PM CST
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
#Define the category map
category_map = {
"talk.politics.misc" : "Politics",
"rec.autos" : "Autos",
"rec.sport.hockey": "Hockey",
"sci.electronics": "Electronics",
"sci.med" : "Medicine" }
#get the training dataset
training_data = fetch_20newsgroups(subset="train",
categories=category_map.keys(), shuffle=True, random_state=5 )
#Build a count vectorizer and extract term counts
count_vectorizer = CountVectorizer()
train_tc = count_vectorizer.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)
#create the tf-idf transformer
tfidf= TfidfTransformer()
train_tfidf=tfidf.fit_transform(train_tc)
#Define test data
input_data=[
"You need to be careful with cars when you are driving on slippery roads",
"A lot of devices can be operated wirelessly",
"Players need to be careful when they are close to goal posts",
"Political debates help us understand the perspectives of both sides",
"Political debates help us understand the perspectives of both slides",
"Political debates help us understand the perspective of both sides",
]
#Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(train_tfidf, training_data.target)
#Transform input data using count vectorizer
input_tc = count_vectorizer.transform(input_data)
#Transform vectorized data using tfidf transformer
input_tfidf = tfidf.transform(input_tc)
#predict the output categories
predictions=classifier.predict(input_tfidf)
#print the outputs
for sent, category in zip(input_data, predictions):
print("\nInput:", sent , "\nPredicted category:", \
category_map[training_data.target_names[category]])

执行结果很有意思：

点击(此处)折叠或打开

In [15]: %cd /usr/local/src/py/py_nlp/usr/local/src/py/py_nlp
In [16]: %run -i /usr/local/src/py/py_nlp/category_predictor.py
Dimensions of training data: (2844, 40321)
Input: You need to be careful with cars when you are driving on slippery roads
Predicted category: Autos
Input: A lot of devices can be operated wirelessly
Predicted category: Electronics
Input: Players need to be careful when they are close to goal posts
Predicted category: Hockey
Input: Political debates help us understand the perspectives of both sides
Predicted category: Politics
Input: Political debates help us understand the perspectives of both slides
Predicted category: Medicine
Input: Political debates help us understand the perspective of both sides
Predicted category: Medicine

后面三句完全是一个letter的差距，category就会变.
sides:方面，
slides: 这个词有滑落的意思，为什么变成slides就会被predict成 Medicine category.
最后一句:
perspectives：复数变单数，也会影响这个句子的归类。Category 类别为Medicine.

3. 结果：
这是个有趣的现象，但是我现在不知道导致这个现象的原因是什么。

阅读(95791) | 评论(0) | 转发(0) |

上一篇：复习了下PYthon的Design Pattern-Command,Mediator Pattern

下一篇：推荐一本用Python做Text游戏的书

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6