阿弥陀佛
分类: 服务器与存储
2013-09-29 18:32:03
刚运行爬虫命令的时候报错。
报错:stopping at depth 0 no more urls to fetch
解决方法:修改的配置文件没有同步到其他的分布式nutch节点上。
nutch-site.xml如下
Remember we have to crawl the local file system. Hence we have to modify the
entries as follows
##-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
-[?*!@=]
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+.*
urls文件
修改crawl-urlfilter.txt
#skip http:, ftp:, & mailto: urls
-^(http|ftp|mailto):
#skip image and other suffixes we can't yet parse
#skip URLs containing certain characters as probable queries, etc.
#accept hosts in MY.DOMAIN.NAME
#accecpt anything else
file://c:/resumes/word
file://c:/resumes/pdf
#file:///data/readings/semanticweb/