linux下安装nutch-1.0--内部网络爬虫和检索的实现-xueliangfei-ChinaUnix博客

Leungffy XUExueliangfei.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

xueliangfei

博客访问： 2329553
博文数量： 252
博客积分： 5472
博客等级：大校
技术积分： 3107
用户组：普通用户
注册时间： 2011-09-17 18:39

文章分类

全部博文（252）

未分配的博文（252）

文章存档

2012年（96）

2011年（156）

我的朋友

相关博文

linux下安装nutch-1.0--内部网络爬虫和检索的实现

分类： LINUX

2012-03-29 17:21:14

Nutch是一个完整的开源全文检索软件，它是建立在lucene java之上增加，增加了一些web特性,
如网络爬虫,link-graph数据库,HTML文本解析和其他格式文档解析,等等。

下载nutch

1.选择安装nutch的目录，我就直接安装到/home/admin下

Java代码  
[root@search-test1 ~]# cd /home/admin/  

2.下载nutch-1.0：

Java代码  
[root@search-test3 admin]# wget ""  

3.解压nutch-1.0.war,建立软链

Java代码  
[root@search-test3 admin]# tar -zxf nutch-1.0.tar.gz   
[root@search-test3 admin]# ln -s nutch-1.0 nutch  

/home/admin下nutch的目录列表

Java代码  
[root@search-test3 admin]# ll|grep 'nutch'  
lrwxrwxrwx 1 root root        9 01-12 14:57 nutch -> nutch-1.0  
drwxr-xr-x 9 root root     4096 2009-03-24 nutch-1.0  
-rw-r--r-- 1 root root 86557549 2009-03-28 nutch-1.0.tar.gz  

内部爬虫的配置

1.在/home/admin/nutch下建立一个urls目录，在urls下建立一个taizhou.txt,爬台州的一个网站（很多大的网站对这中野爬虫都做了屏蔽，最后才选择了taizhou.com）。

Java代码  
[root@search-test3 nutch]# mkdir /home/admin/nutch/urls;touch /home/admin/nutch/urls/taizhou.txt  
.....  
[root@search-test3 nutch]# cat /home/admin/nutch/urls/taizhou.txt  
http://  

2.编辑conf/crawl-urlfilter.txt，替换“MY.DOMAIN.NAME”为“taizhou.com”，如下所示：

Java代码  
+^http://([a-z0-9]*\.)*taizhou.com/  

3.编辑conf/nutch-site.xml，配置爬虫携带的http头的信息，这里只是部分属性

Java代码  
[root@search-test3 conf]# cat nutch-site.xml     
"1.0"?>  
"text/xsl" href="configuration.xsl"?>  
  
  http.agent.name  
  8qiu-spider  
  HTTP 'User-Agent' request header. MUST NOT be empty -   
  please set this to a single word uniquely related to your organization.  
  
  NOTE: You should also check other related properties:  
  
        http.robots.agents  
        http.agent.description  
        http.agent.url  
        http.agent.email  
        http.agent.version  
  
  and set their values appropriately.  
  
  http.agent.description  
  this is a crawler of 8qiu  
  Further description of our bot- this text is used in  
  the User-Agent header.  It appears in parenthesis after the agent name.  
    
  http.agent.url  
    
  A URL to advertise in the User-Agent header.  This will   
   appear in parenthesis after the agent name. Custom dictates that this  
   should be a URL of a page explaining the purpose and behavior of this  
   crawler.  
    
  http.agent.email  
  javalover@yeah.net  
  An email address to advertise in the HTTP 'From' request  
   header and User-Agent header. A good practice is to mangle this  
   address (e.g. 'info at example dot com') to avoid spamming.

4.启动爬虫程序

Java代码  
/home/admin/nutch/bin/nutch crawl /home/admin/nutch/urls/ -dir /home/admin/nutch/crawl -depth 3 -topN 100  

安装Web运行环境
1.安装tomcat,我的tomcat目录/usr/local/tomcat

2.把nutch.1.0的war包移到webapp目录下

Java代码  
mv nutch-1.0.jar /usr/local/tomcat/webapps/  

3.启动tomcat

Java代码  
[root@search-test3 nutch]# /usr/local/tomcat/bin/startup.sh  
Using CATALINA_BASE:   /usr/local/tomcat  
Using CATALINA_HOME:   /usr/local/tomcat  
Using CATALINA_TMPDIR: /usr/local/tomcat/temp  
Using JRE_HOME:       /usr/local/jdk1.6.0_10  

必须要在/home/admin/nutch下敲如下命令，切记，否则它会找不到/home/admin/nutch/crawl目录

启动完成之后，检查一下tomcat的日子：/usr/local/tomcat/logs/catalina.out

如果一切都正常，，就能搜索到结果了

阅读(3966) | 评论(0) | 转发(0) |

上一篇：lucene2.2.0学习实例

下一篇：no segments* file found in org.apache.lucene.store.FSDirectory@

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6