nutch0.9 Windows下的安装和使用(一)-ubuntuer-ChinaUnix博客

人生如逆旅，我亦是行人！江湖人称wsjjeremy.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

ubuntuer

博客访问： 4900172
博文数量： 930
博客积分： 12070
博客等级：上将
技术积分： 11448
用户组：普通用户
注册时间： 2008-08-15 16:57

文章分类

全部博文（930）

html5（0）
python（1）
google_gnu fans（8）
高品位（2）
perl（4）
mobile_dev（2）
openssl（1）
libcurl（2）
windows内核安全（5）
自己的C_LIB（5）
高性能MySQL学习（94）
多线程（4）
ldd学习笔记（3）
netfilter（3）
笔试题（5）
师徒之言传身教（1）
转载（15）
work（146）
introduction to （9）
debug（3）

intern（3）
mobile ip（0）
毕业设计（2）
linux防火墙（10）
c++（16）
database（13）
CentOS（11）
data structure（5）
kernel（50）
DIY（4）
酷软（19）
iptables（9）
linux c（105）

string（19）
APUE学习笔记（7）
facetea（13）
shell（68）
tcp_ip（23）
apache（3）
linux（258）

正则表达式（5）
未分配的博文（1）

文章存档

2011年（60）

2010年（220）

2009年（371）

2008年（279）

我的朋友

相关博文

nutch0.9 Windows下的安装和使用(一)

分类： WINDOWS

2009-01-02 11:38:39

一:安装

1. 安装JDK和Tomcat

需要注意的是J2EE SDK和Apache's Tomact均是在Windows上运行的程序，在下载时需要选择For Windows而不是For Linux。J2EE SDK和Apache's Tomact的安装都很简单，没什么好多说的。

2. 安装Cygwin

到Cygwin官方网站下载的程序只是一个用于下载Cygwin的程序，需要通过这个下载程序下载Cygwin。可以到网上搜索一下相关的安装教程，过程也比较简单。

3. 安装Nutch

把从Nutch官方下载的压缩文件拷贝到Cygwin所在目录下的usr\local的目录中（放到别的目录中也可以，但最好在Cygwin下的目录中，否则执行起命令来就比较麻烦），然后解压。你也可以先解压再拷贝，这都无关紧要。最后我把解压的文件夹重命名为nutch，如果你没有重命名，在执行下面的命令时要注意替换。

在Cygwin环境下进入nutch-0.9目录下，使用命令 bin/nutch进行测试，正常的情况下出现的结果是:

二、配置Nutch及执行爬行操作

在爬行之前，还需要做一些准备工作。需要注意的是在下文中如果没有特别说明，均指操作是在Windows中进行的。

1. 在Nutch目录下创建一个文件用来存放要抓取的网址，这里我们创建了一个名为url.txt的文本文件，文件内容如下：

http://blog.chinaunix.net/u2/76292/

2. 打开Nutch目录下的conf/crawl-urlfilter.txt文件，设置爬虫搜索的范围，内容如下：

# accept hosts in MY.DOMAIN.NAME
+^http://blog.chinaunix.net/u2/76292/

3. 打开Nutch目录下的conf/nutch-site.xml文件，在和之间添加如下内容：

<property>
  <name>http.agent.namename>
  <value>ubuntuervalue>
  <description>description>
property>

<property>
  <name>http.agent.descriptionname>
  <value>ubuntuervalue>
  <description>description>
property>

<property>
  <name>http.agent.urlname>
  <value>http://blog.chinaunix.net/u2/76292/value>
  <description>description>
property>

<property>
  <name>http.agent.emailname>
  <value>iptabler@gmail.comvalue>
  <description>description>
property>

在上面的property的value中设置你爬虫的信息，这些信息将会附加在你发送给服务器的HTTP请求中。

4. 在Cygwin中执行如下命令：

$ cd nutch-0.9
$ bin/nutch crawl url.txt -dir blog -depth 2 -threads 10 >& crawl.log

上面的url.txt指定了我们创建的存有网址的文本文件，blog是用于保存Nutch创建的索引文件的文件夹，后面配置Tomcat时要用到。参数depth指定了爬虫爬行的深度，参数threads指定了用于爬行的并发线程数。

这里注意下路径问题就可以了...

url.txt就是第一步创建的。

四、在Tomcat中搭建搜索程序

1. 到Tomcat的webapps目录下，删除Root文件夹（如果你害怕，那就先给这个文件夹改个名字吧），将Nutch目录中的nutch-0.9.war重命名为root.war，然后将root.war拷贝到Tomcat\webapps目录下，Tomcat会自动为你创建一个名为ROOT的文件。

2. 在webapps\root\web-inf\classes\nutch-site.xml的文件中添加如下内容：

   <property>
　　    <name>searcher.dirname>
　　    <value>C:\cygwin\home\zj\nutch\nutch-0.9\blogvalue>
    property>

修改value为你爬行的程序存放索引的目录,根据你自己的需要修改
3. 在Web浏览器中访问，即可使用Nutch搜索刚才爬行过的网页中的内容了。

应该没太大问题,主要是注意下路径问题.这个你不懂的话,自己先补补吧

阅读(3876) | 评论(10) | 转发(0) |

上一篇：netlink socket编程实例解析

下一篇：nutch0.9 Windows下的安装和使用(二)

给主人留下些什么吧！~~

chinaunix网友2009-12-10 23:31:16

mbt anti shoes mbt anti shoe the mbt shoe mbt shoe store mbt review mbt shoe review mbt shoes review mbt shoe stores mbt sale s

回复 | 举报

chinaunix网友2009-12-10 23:31:03

uggaustralia ugg mini black discount Ugg ugg mini cheap ugg boots ugg classic mini boots mini

回复 | 举报

chinaunix网友2009-12-10 23:28:31

UGG Classic Mini UGG クラシックミニ UGG Classic Tall UGG クラシックトール UGG Classic Short UGG クラシックショート

回复 | 举报

chinaunix网友2009-11-03 16:31:28

Ugg Classic Cardy Ugg Boots Sale Women Ugg Boots Ugg Classic Boots Discount Ugg boots ugg bailey boots 回复 | 举报

chinaunix网友2009-11-03 16:30:48

The thick fleecy inner of our merino The thick fleecy inner of our merino [url=http://www.gouggs.com/products_all.html]free shipping UGG Bailey Button sheepskin [/url] ugg boots are constructed from millions of microscopic fibres which posses extremely remarkable qualities.Both summer and winter,it will keep our feet in comfortable 22 degree [url=http://www.gouggs.com/ugg-bailey-button-c-10.html]UGG Bailey Button constant temperature[/url] . Own it ,you will find it is different from other boot

回复 | 举报