用R读取PDF并进行数据挖掘-jieforest-ChinaUnix博客

一名系统架构师的博客

首页　| 　博文目录　| 　关于我

jieforest

博客访问： 4178045
博文数量： 626
博客积分： 10
博客等级：民兵
技术积分： 11080
用户组：普通用户
注册时间： 2012-08-23 13:08

文章分类

全部博文（626）

关系数据库（1）
Scala（1）
Node.js（1）
Web服务（1）
Linux（3）
虚拟化（5）
JavaEE（7）
PHP（1）
前端框架（1）
Ruby（1）
网络通信（11）
安全（1）
Erlang（1）
分布式计算（2）
Linux（3）
HTML5（2）
NoSQL（10）
应用服务器（4）
大数据（4）
IDE开发工具（4）
前沿趋势（4）
游戏引擎（2）
Python（3）
数据分析＆数据挖（3）
Scala（2）
云计算＆云存储（7）
Node.JS（19）
web开发（20）
消息中间件（13）
移动开发（13）
数据库及工具（25）
嵌入式开发（10）
QT及GTK+界面设计（1）
JVM（23）
操作系统（13）
高并发（1）
Hadoop（1）
行业工具（14）
文献工具EndNote（5）
RIA技术（8）
图形图像（2）
PHP（1）
java工具（34）
DTV数字电视（30）
表现层技术（2）
脚本技术（20）
项目跟踪JTrac（1）
JSF（19）
GWT＆GAE（5）
软件项目管理（9）
JavaEE开发（71）
C++（3）
杂文（18）
Java&算法（61）
Ant与Maven（3）
Java报表及其工具（12）
数据挖掘（Data&n（3）
软件体系结构（10）
Web测试与软件测（40）
软件工程（9）
软件工程工具（5）
防火墙技术（0）
排版TeX和LaTeX（4）
未分配的博文（13）

文章存档

2015年（72）

2014年（48）

2013年（506）

我的朋友

最近访客

推荐博文

用R读取PDF并进行数据挖掘

分类： IT业界

2013-10-11 09:28:32

用R读取PDF并进行数据挖掘，例子如下：

[javascript] view plain copy print ?

# here is a pdf for mining
url <- "http://www.noisyroom.net/blog/RomneySpeech072912.pdf"
dest <- tempfile(fileext = ".pdf")
download.file(url, dest, mode = "wb")
# set path to pdftotxt.exe and convert pdf to text
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)
# get txt-file name and open it
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..
# do something with it, i.e. a simple word cloud
library(tm)
library(wordcloud)
library(Rstem)
txt <- readLines(filetxt) # don't mind warning..
txt <- tolower(txt)
txt <- removeWords(txt, c("\\f", stopwords()))
corpus <- Corpus(VectorSource(txt))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE))
# Stem words
d$stem <- wordStem(row.names(d), language = "english")
# and put words to column, otherwise they would be lost when aggregating
d$word <- row.names(d)
# remove web address (very long string):
d <- d[nchar(row.names(d)) < 20, ]
# aggregate freqeuncy by word stem and
# keep first words..
agg_freq <- aggregate(freq ~ stem, data = d, sum)
agg_word <- aggregate(word ~ stem, data = d, function(x) x[1])
d <- cbind(freq = agg_freq[, 2], agg_word)
# sort by frequency
d <- d[order(d$freq, decreasing = T), ]
# print wordcloud:
wordcloud(d$word, d$freq)
# remove files
file.remove(dir(tempdir(), full.name=T)) # remove files

阅读(2641) | 评论(0) | 转发(0) |

上一篇： Python 3.3版发布

下一篇： Eclipse 4.2 SR1版悄悄发布

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6