nutch官方网重爬的script-xpjjy-ChinaUnix博客

四方

首页　| 　博文目录　| 　关于我

xpjjy

博客访问： 579345
博文数量： 136
博客积分： 4010
博客等级：上校
技术积分： 1343
用户组：普通用户
注册时间： 2008-08-19 23:18

文章分类

全部博文（136）

unix（2）
nutch（7）
SSH笔记（1）
我的文章（126）

MyEclipse（2）

EXTJS（4）

struts2（4）

Spring（2）

Oracle（2）

GIS_Java（1）

swt（1）

jms（1）

异常大杂烩（4）

异常锦集（0）

tomcat（0）

heritrix（4）

算法（4）

Servlet（5）

J2EE（3）

struts（17）

Hibernate（21）

SQL（6）

xml（0）

javascript（4）

jsp（8）

java（32）
未分配的博文（0）

文章存档

2011年（28）

2009年（60）

2008年（48）

我的朋友

Version 0.8.0 and 0.9.0

Place in the bin sub-directory within your Nutch install and run.

CALL THE SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK

Example Usage

./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT /usr/local/nutch/crawl 10 31

Setting adddays at 31 causes all pages will to be recrawled.

Changes for 0.9.0

No changes necessary for this to run with Nutch 0.9.0.

Code

#!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html # # The script merges the new segments all into one segment to prevent redundant # data. However, if your crawl/segments directory is becoming very large, I # would suggest you delete it completely and generate a new crawl. This probaly # needs to be done every 6 months. # # Modified by Matthew Holt # mholt at elon dot edu if [ -n "$1" ] then tomcat_dir=$1 else echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc at/webapps/ROOT)" echo "crawl_dir - Path of the directory the crawl is located in. (full path, i e: /home/user/nutch/crawl)" echo "depth - The link depth from the root page that should be crawled." echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n one]" echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." exit 1 fi if [ -n "$2" ] then crawl_dir=$2 else echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc at/webapps/ROOT)" echo "crawl_dir - Path of the directory the crawl is located in. (full path, i e: /home/user/nutch/crawl)" echo "depth - The link depth from the root page that should be crawled." echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n one]" echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." exit 1 fi if [ -n "$3" ] then depth=$3 else echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc at/webapps/ROOT)" echo "crawl_dir - Path of the directory the crawl is located in. (full path, i e: /home/user/nutch/crawl)" echo "depth - The link depth from the root page that should be crawled." echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n one]" echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." exit 1 fi if [ -n "$4" ] then adddays=$4 else echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)" echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)" echo "depth - The link depth from the root page that should be crawled." echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n one]" echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." exit 1 fi if [ -n "$5" ] then topn="-topN $5" else topn="" fi #Sets the path to bin nutch_dir=`dirname $0` # Only change if your crawl subdirectories are named something different webdb_dir=$crawl_dir/crawldb segments_dir=$crawl_dir/segments linkdb_dir=$crawl_dir/linkdb index_dir=$crawl_dir/index # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do $nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays segment=`ls -d $segments_dir

阅读(1241) | 评论(1) | 转发(0) |

上一篇：nutch为什么不能爬行到相对网址？一些页面没有被编入索引，但是我的正则表达式和其他文件都是正确的？

下一篇：Shell学习笔记1

给主人留下些什么吧！~~

chinaunix网友2010-10-09 16:46:31

hmmm it's google script follow me http://nutchtutorial.com

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6