分类: LINUX
2010-08-12 15:04:02
最近公司需要搭建一个搜索引擎,于是就发现了apache旗下的这个nutch,也看了不少的文章,就在本地搭建了一个进行测试,发现局域网抓取还是比较好的,但是在互联网抓取还是有点问题,像百度、谷歌这些站点的页面基本就抓不到上信息,不知道是配置问题还是其他问题,希望有知道的朋友联系我,谢谢.
vmware 6.0
redhat 5.1
apache-tomcat-6.0.29.tar.gz
nutch-1.0.tar.gz
jdk-6u21-linux-i586.bin
nutchg简介
Nutch的爬虫抓取网页有两种方式,一种方式是Intranet Crawling,针对的是企业内部网或少量网站,使用的是crawl命令;另一种方式是Whole-web crawling,针对的是整个互联网,使用inject、generate、fetch和updatedb等更底层的命令.本文档介绍Intranet Crawling的基本使用方法.
# cp jdk-6u21-linux-i586.bin /usr/java
# cd /usr/java
# chmod +x jdk-6u21-linux-i586.bin
# ./ jdk-6u21-linux-i586
# vi /etc/profile //添加如下的java环境变量
JAVA_HOME=/usr/java/jdk1.6.0_21
export JAVA_HOME
PATH=$JAVA_HOME/bin:$PATH
export PATH
CLASSPATH=$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar:$CLASSPATH
export CLASSPATH
# source /etc/profile //让java环境变量立即生效
# java -version //测试java环境是否正常,返回版本信息,就表示jdk安装没有问题
# tar zxvf apache-tomcat-6.0.29.tar.gz -C /usr/local
# cd /usr/local/
# mv apache-tomcat-6.0.29 tomcat
# tar zxvf nutch-1.0.tar.gz -C /usr/local
# cd /usr/local
# mv nutch-1.0 nutch
# cd nutch
增加NUTCH_JAVA_HOME变量,并将其值设为JDK的安装目录
NUTCH_JAVA_HOME=/usr/java/jdk1.6.0_21
export NUTCH_JAVA_HOME
Nutch抓取网站页面前的准备工作
在Nutch的安装目录中建立一个名为url.txt的文本文件,文件中写入要抓取网站的顶级网址,即要抓取的起始页.
这里写入国内比较有名的站点
编辑conf/crawl-urlfilter.txt文件,修改MY.DOMAIN.NAME部分:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*com/
+^http://([a-z0-9]*\.)*cn/
+^http://([a-z0-9]*\.)*net/
解决搜索动态内容的问题
需要注意在conf下面的2个文件:regex-urlfilter.txt,crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
这段意思是跳过在连接中存在? * ! @ = 的页面,因为默认是跳过所以,在动态页中存在?一般
按照默认的是不能抓取到的.可以在上面2个文件中都修改成:
# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=] //前面加上注释.
另外增加允许的一行
# accept URLs containing certain characters as probable queries, etc.
+[?=&]
意思是抓取时候允许抓取连接中带 ? = & 这三个符号的连接
注意:两个文件都需要修改,因为NUTCH加载规则的顺序是crawl-urlfilter.txt->
regex-urlfilter.txt
编辑conf/nutch-site.xml文件,在configuration中间加入一下内容
/usr/local/nutch/bin/nutch crawl /usr/local/nutch/url.txt -dir /usr/local/nutch/sxit -depth 3 -threads 4 >& /usr/loca/nutch/crawl.log
等待大约一段时间后,程序运行结束.会发现在nutch目录下被创建了一个名为sxit的文件夹,同时还生成一个名为crawl.log的日志文件.利用这一日志文件,我们可以分析可能遇到的任何错误.另外,在上述命令的参数中,dir指定抓取内容所存放的目录,depth表示以要抓取网站顶级网址为起点的爬行深度,threads指定并发的线程数.
使用Tomcat进行搜索测试
将nutch目录的nutch-1.0.war复制到tomcat\webapps下,这里需要启动下tomcat,然后就在webapps下面生成一个nutch-1.0的文件夹,打开 nutch-1.0\WEB-INF\classes下的nutch-site.xml文件,
//由于这里是最新的版本,原来这个配置文件的内容都删掉,添加如下的内容
# cd /usr/local/apache-tomcat-6.0.29/ webapps/nutch-1.0
# vi search.jsp
查找int hitsPerSite 把=后面的值改成0,
然后在这个jsp文件的末尾增加如下的代码:
<%
if (start >= hitsPerPage) // more hits to show
{
%>
<%
int startnum=1;//页面中最前面的页码编号,我设定(满足)共10页,当页为第6页
if((int)(start/hitsPerPage)>=5)
startnum=(int)(start/hitsPerPage)-4;
for(int i=hitsPerPage*(startnum-1),j=0;i<=hits.getTotal()&&j<=10;)
{
%>
<%
i=i+10; //这里的10是分页显示页面数
j++;
}
%>
<%
if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
|| (!hits.totalIsExact() && (hits.getLength() > start
+ hitsPerPage))) {
%>
<%} %>
- #!/bin/sh
- depth=5
- threads=5
- RMARGS="-rf"
- MVARGS="--verbose"
- safe=yes
- NUTCH_HOME=/usr/local/nutch
- CATALINA_HOME=/usr/local/apache-tomcat-6.0.29
- if [ -z "$NUTCH_HOME" ]
- then
- echo runbot: $0 could not find environment variable NUTCH_HOME
- echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
- else
- echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
- fi
- if [ -z "$CATALINA_HOME" ]
- then
- echo runbot: $0 could not find environment variable NUTCH_HOME
- echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
- else
- echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
- fi
- if [ -n "$topN" ]
- then
- topN="-topN $topN"
- else
- topN=""
- fi
- steps=8
- echo "----- Inject (Step 1 of $steps) -----"
- $NUTCH_HOME/bin/nutch inject $NUTCH_HOME/sxit/crawldb $NUTCH_HOME/url.txt
- echo "----- Generate, Fetch, Parse, Update (Step 2 o $steps) -----"
- for((i=0; i <= $depth; i++))
- do
- echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
- $NUTCH_HOME/bin/nutch generate $NUTCH_HOME/sxit/crawldb $NUTCH_HOME/sxit/segments
- if [ $? -ne 0 ]
- then
- echo "runbot: Stopping at depth $depth. No more URLs to fetcfh."
- break
- fi
- segment=`ls -d $NUTCH_HOME/sxit/segments/* | tail -1`
- $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
- if [ $? -ne 0 ]
- then
- echo "runbot: fetch $segment at depth `expr $i + 1` failed."
- echo "runbot: Deleting segment $segment."
- rm $RMARGS $segment
- continue
- fi
- $NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/sxit/crawldb $segment
- done
- echo "----- Merge Segments (Step 3 of $steps) -----"
- $NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/sxit/MERGEDsegments $NUTCH_HOME/sxit/segments/*
- mv $MVARGS $NUTCH_HOME/sxit/segments $NUTCH_HOME/sxit/BACKUPsegments
- mkdir $NUTCH_HOME/sxit/segments
- mv $MVARGS $NUTCH_HOME/sxit/MERGEDsegments/* $NUTCH_HOME/sxit/segments
- rm $RMARGS $NUTCH_HOME/sxit/MERGEDsegments
- echo "----- Invert Links (Step 4 of $steps) -----"
- $NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/sxit/linkdb $NUTCH_HOME/sxit/segments/*
- echo "----- Index (Step 5 of $steps) -----"
- $NUTCH_HOME/bin/nutch index $NUTCH_HOME/sxit/NEWindexes $NUTCH_HOME/sxit/crawldb $NUTCH_HOME/sxit/linkdb $NUTCH_HOME/sxit/segments/*
- echo "----- Dedup (Step 6 of $steps) -----"
- $NUTCH_HOME/bin/nutch dedup $NUTCH_HOME/sxit/NEWindexes
- echo "----- Merge Indexes (Step 7 of $steps) -----"
- $NUTCH_HOME/bin/nutch merge $NUTCH_HOME/sxit/NEWindex $NUTCH_HOME/sxit/NEWindexes
- echo "----- Loading New Index (Step 8 of $steps) -----"
- tom_pid=`ps aux |awk '/usr\/local\/apache-tomcat-6.0.29/ {print $2}'`
- `kill -9 $tom_pid`
- if [ "$safe" != "yes" ]
- then
- rm $RMARGS $NUTCH_HOME/sxit/NEWindexes
- rm $RMARGS $NUTCH_HOME/sxit/index
- else
- mv $MVARGS $NUTCH_HOME/sxit/NEWindexes $NUTCH_HOME/sxit/indexes
- mv $MVARGS $NUTCH_HOME/sxit/NEWindex $NUTCH_HOME/sxit/index
- fi
- ${CATALINA_HOME}/bin/startup.sh
- echo "runbot: FINISHED: Crawl completed!"
- echo ""