• 博客访问： 373052
• 博文数量： 271
• 博客积分： 285
• 博客等级： 二等列兵
• 技术积分： 2696
• 用 户 组： 普通用户
• 注册时间： 2010-03-06 22:04
• 认证徽章：

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

2018年（41）

2017年（80）

2016年（80）

2015年（58）

2014年（1）

2013年（8）

2012年（3）

2018-02-28 17:45:57

1. 这个代码是一个category_predictor, 需要用到下面的这个库. -> sklearn

1. pip3 install sklearn

2. 代码如下：

1. #!/usr/bin/env python3
2. # -*- coding: utf-8 -*-
3. # Author: hezhb
4. # Created Time: Mon 26 Feb 2018 04:27:53 PM CST

5. from sklearn.datasets import fetch_20newsgroups
6. from sklearn.naive_bayes import MultinomialNB
7. from sklearn.feature_extraction.text import TfidfTransformer
8. from sklearn.feature_extraction.text import CountVectorizer

9. #Define the category map
10. category_map = {
11.     "talk.politics.misc" : "Politics",
12.     "rec.autos" : "Autos",
13.     "rec.sport.hockey": "Hockey",
14.     "sci.electronics": "Electronics",
15.     "sci.med" : "Medicine" }

16. #get the training dataset
17. training_data = fetch_20newsgroups(subset="train",
18.     categories=category_map.keys(), shuffle=True, random_state=5 )
19.

20. #Build a count vectorizer and extract term counts
21. count_vectorizer = CountVectorizer()
22. train_tc = count_vectorizer.fit_transform(training_data.data)
23. print("\nDimensions of training data:", train_tc.shape)

24. #create the tf-idf transformer
25. tfidf= TfidfTransformer()
26. train_tfidf=tfidf.fit_transform(train_tc)

27. #Define test data
28. input_data=[
29.     "You need to be careful with cars when you are driving on slippery roads",
30.     "A lot of devices can be operated wirelessly",
31.     "Players need to be careful when they are close to goal posts",
32.     "Political debates help us understand the perspectives of both sides",
33.     "Political debates help us understand the perspectives of both slides",
34.     "Political debates help us understand the perspective of both sides",
35.     ]

36. #Train a Multinomial Naive Bayes classifier
37. classifier = MultinomialNB().fit(train_tfidf, training_data.target)

38. #Transform input data using count vectorizer
39. input_tc = count_vectorizer.transform(input_data)

40. #Transform vectorized data using tfidf transformer
41. input_tfidf = tfidf.transform(input_tc)

42. #predict the output categories
43. predictions=classifier.predict(input_tfidf)

44. #print the outputs
45. for sent, category in zip(input_data, predictions):
46.     print("\nInput:", sent , "\nPredicted category:", \
47.         category_map[training_data.target_names[category]])

1. In [15]: %cd /usr/local/src/py/py_nlp/usr/local/src/py/py_nlp

2. In [16]: %run -i /usr/local/src/py/py_nlp/category_predictor.py

3. Dimensions of training data: (2844, 40321)

4. Input: You need to be careful with cars when you are driving on slippery roads
5. Predicted category: Autos

6. Input: A lot of devices can be operated wirelessly
7. Predicted category: Electronics

8. Input: Players need to be careful when they are close to goal posts
9. Predicted category: Hockey

10. Input: Political debates help us understand the perspectives of both sides
11. Predicted category: Politics

12. Input: Political debates help us understand the perspectives of both slides
13. Predicted category: Medicine

14. Input: Political debates help us understand the perspective of both sides
15. Predicted category: Medicine

sides:方面，
slides: 这个词有滑落的意思，为什么变成slides就会被predict成 Medicine category.

perspectives：复数变单数，也会影响这个句子的归类。Category 类别为Medicine.

3. 结果：