Chinaunix首页 | 论坛 | 博客
  • 博客访问: 486549
  • 博文数量: 148
  • 博客积分: 2510
  • 博客等级: 少校
  • 技术积分: 1553
  • 用 户 组: 普通用户
  • 注册时间: 2008-02-23 23:09
文章分类

全部博文(148)

文章存档

2010年(6)

2009年(58)

2008年(84)

我的朋友

分类: LINUX

2010-03-08 21:08:46

1.google搜索的关键字:nutch 分布式配置
 内容有很多可以看看,
2.nutch的分布式体现在:分布式抓取网页,分布式索引,分布式检索
3.具体配置的参考文章
  http://blog.csdn.net/lianqiang198505/archive/2007/04/18/1569680.aspx
4.nutch的一些有用的文章:
  Nutch-0.9源代码:Crawl类整体分析(转载)
  nutch源代码研究

Downloading Nutch and Hadoop


Both Nutch and Hadoop are downloadable from the apache website. The necessary Hadoop files are bundled with Nutch so unless you are going to be developing Hadoop you only need to download Nutch.

We built Nutch from source after downloading it from its subversion repository. There are nightly builds of both Nutch and Hadoop here:

I am using eclipse for development so I used the eclipse plugin for subversion to download both the Nutch and Hadoop repositories. The subversion plugin for eclipse can be downloaded through the update manager using the url:

If you are not using eclipse you will need to get a subversion client. Once you have a subversion client you can either browse the Nutch subversion webpage at:

Or you can access the Nutch subversion repository through the client at:

I checked out the main trunk into my eclipse but it can be checked out to a standard filesystem as well. We are going to use ant to build it so if you have java and ant installed you should be fine.

I am not going to go into how to install java or ant, if you are working with this level of software you should know how to do that and there are plenty of tutorial on building software with ant. If you want a complete reference for ant pick up Erik Hatcher's book "Java Development with Ant":

Building Nutch and Hadoop


Once you have Nutch downloaded go to the download directory where you should see the following folders and files:

+ bin
+ conf
+ docs
+ lib
+ site
+ src
build.properties (add this one)
build.xml
CHANGES.txt
default.properties
index.html
LICENSE.txt
README.txt

Add a build.properties file and inside of it add a variable called dist.dir with its value being the location where you want to build nutch. So if you are building on a linux machine it would look something like this:

dist.dir=/path/to/build

This step is actually optional as Nutch will create a build directory inside of the directory where you unzipped it by default, but I prefer building it to an external directory. You can name the build directory anything you want but I recommend using a new empty folder to build into. Remember to create the build folder if it doesn't already exist.

To build nutch call the package ant task like this:


This should build nutch into your build folder. When it is finished you are ready to move on to deploying and configuring nutch.

ant package


1.注解:在linux下用

svn checkout

下载下来nutch中的某一个版本,一般是trunk下的主版本。
然后,在下载的下来的项目中添加一个文件 build.properties 里面的内容是编译以后的文件存放位置。
最后,用

ant package

编译项目源码。完成以后,继续下面的配置。

Setting Up The Deployment Architecture


Once we get nutch deployed to all six machines we are going to call a script called start-all.sh that starts the services on the master node and data nodes. This means that the script is going to start the hadoop daemons on the master node and then will ssh into all of the slave nodes and start daemons on the slave nodes.

The start-all.sh script is going to expect that nutch is installed in exactly the same location on every machine. It is also going to expect that Hadoop is storing the data at the exact same filepath on every machine.

The way we did it was to create the following directory structure on every machine. The search directory is where Nutch is installed. The filesystem is the root of the hadoop filesystem. The home directory is the nutch users's home directory. On our master node we also installed a tomcat 5.5 server for searching.

/nutch
/search
(nutch installation goes here)
/filesystem
/local (used for local directory for searching)
/home
(nutch user's home directory)
/tomcat (only on one server for searching)

I am not going to go into detail about how to install tomcat as again there are plenty of tutorials on how to do that. I will say that we removed all of the wars from the webapps directory and created a folder called ROOT under webapps into which we unzipped the Nutch war file (nutch-0.8-dev.war). This makes it easy to edit configuration files inside of the Nutch war

So log into the master nodes and all of the slave nodes as root. Create the nutch user and the different filesystems with the following commands:

ssh -l root devcluster01

mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/local
mkdir /nutch/home

groupadd users
useradd -d /nutch/home -g users nutch
chown -R nutch:users /nutch
passwd nutch nutchuserpassword

Again if you don't have root level access you will still need the same user on each machine as the start-all.sh script expects it. It doesn't have to be a user named nutch user although that is what we use. Also you could put the filesystem under the common user's home directory. Basically, you don't have to be root, but it helps.

The start-all.sh script that starts the daemons on the master and slave nodes is going to need to be able to use a password-less login through ssh. For this we are going to have to setup ssh keys on each of the nodes. Since the master node is going to start daemons on itself we also need the ability to user a password-less login on itself.

You might have seen some old tutorials or information floating around the user lists that said you would need to edit the SSH daemon to allow the property and to setup local environment variables for the ssh logins through an environment file. This has changed. We no longer need to edit the ssh daemon and we can setup the environment variables inside of the hadoop-env.sh file. Open the hadoop-env.sh file inside of vi:

cd /nutch/search/conf
vi hadoop-env.sh

Below is a template for the environment variables that need to be changed in the hadoop-env.sh file:

export HADOOP_HOME=/nutch/search
export JAVA_HOME=/usr/java/jdk1.5.0_06
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

There are other variables in this file which will affect the behavior of Hadoop. If when you start running the script later you start getting ssh errors, try changing the HADOOP_SSH_OPTS variable. Note also that, after the initial copy, you can set HADOOP_MASTER in your conf/hadoop-env.sh and it will use rsync changes on the master to each slave node. There is a section below on how to do this.

Next we are going to create the keys on the master node and copy them over to each of the slave nodes. This must be done as the nutch user we created earlier. Don't just su in as the nutch user, start up a new shell and login as the nutch user. If you su in the password-less login we are about to setup will not work in testing but will work when a new session is started as the nutch user.

cd /nutch/home

ssh-keygen -t rsa (Use empty responses for each prompt)
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /nutch/home/.ssh/id_rsa.
Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
The key fingerprint is:
a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc nutch@localhost

On the master node you will copy the public key you just created to a file called authorized_keys in the same directory:

cd /nutch/home/.ssh
cp id_rsa.pub authorized_keys

You only have to run the ssh-keygen on the master node. On each of the slave nodes after the filesystem is created you will just need to copy the keys over using scp.

scp /nutch/home/.ssh/authorized_keys nutch@devcluster02:/nutch/home/.ssh/authorized_keys

You will have to enter the password for the nutch user the first time. An ssh propmt will appear the first time you login to each computer asking if you want to add the computer to the known hosts. Answer yes to the propmt. Once the key is copied you shouldn't have to enter a password when logging in as the nutch user. Test it by logging into the slave nodes that you just copied the keys to:

ssh devcluster02
nutch@devcluster02$ (a command prompt should appear without requiring a password)
hostname (should return the name of the slave node, here devcluster02)
Once we have the ssh keys created we are ready to start deploying nutch to all of the slave nodes.


遇到的问题
1.

[root@master nutch-0.9]# ./bin/nutch crawl urls -dir crawled -depth 3

crawl started in: crawled
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawled/segments/20100311091919
Generator: filtering: false
Generator: topN: 2147483647
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawled


问题产生的环境,nutch可以分布式的启动,我要进行分布式的抓取,在分布式文件系统上创建urls文件夹,里面创建url.txt文件,文件内容如下:

http://www.sina.com.cn
http://www.apache.org

结果就是抓取结果为0.

问题分析解决中:

阅读(1288) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~