分类: LINUX
2010-03-08 21:08:46
Both Nutch and Hadoop are downloadable from the apache website. The necessary Hadoop files are bundled with Nutch so unless you are going to be developing Hadoop you only need to download Nutch.
We built Nutch from source after downloading it from its subversion repository. There are nightly builds of both Nutch and Hadoop here:
I am using eclipse for development so I used the eclipse plugin for subversion to download both the Nutch and Hadoop repositories. The subversion plugin for eclipse can be downloaded through the update manager using the url:
If you are not using eclipse you will need to get a subversion client. Once you have a subversion client you can either browse the Nutch subversion webpage at:
Or you can access the Nutch subversion repository through the client at:
I checked out the main trunk into my eclipse but it can be checked out to a standard filesystem as well. We are going to use ant to build it so if you have java and ant installed you should be fine.
I am not going to go into how to install java or ant, if you are working with this level of software you should know how to do that and there are plenty of tutorial on building software with ant. If you want a complete reference for ant pick up Erik Hatcher's book "Java Development with Ant":
Once you have Nutch downloaded go to the download directory where you should see the following folders and files:
+ bin |
Add a build.properties file and inside of it add a variable called dist.dir with its value being the location where you want to build nutch. So if you are building on a linux machine it would look something like this:
dist.dir=/path/to/build |
This step is actually optional as Nutch will create a build directory inside of the directory where you unzipped it by default, but I prefer building it to an external directory. You can name the build directory anything you want but I recommend using a new empty folder to build into. Remember to create the build folder if it doesn't already exist.
To build nutch call the package ant task like this:
This should build nutch into your build folder. When it is finished you are ready to move on to deploying and configuring nutch.
|
|
|
Once we get nutch deployed to all six machines we are going to call a script called start-all.sh that starts the services on the master node and data nodes. This means that the script is going to start the hadoop daemons on the master node and then will ssh into all of the slave nodes and start daemons on the slave nodes.
The start-all.sh script is going to expect that nutch is installed in exactly the same location on every machine. It is also going to expect that Hadoop is storing the data at the exact same filepath on every machine.
The way we did it was to create the following directory structure on every machine. The search directory is where Nutch is installed. The filesystem is the root of the hadoop filesystem. The home directory is the nutch users's home directory. On our master node we also installed a tomcat 5.5 server for searching.
/nutch |
I am not going to go into detail about how to install tomcat as again there are plenty of tutorials on how to do that. I will say that we removed all of the wars from the webapps directory and created a folder called ROOT under webapps into which we unzipped the Nutch war file (nutch-0.8-dev.war). This makes it easy to edit configuration files inside of the Nutch war
So log into the master nodes and all of the slave nodes as root. Create the nutch user and the different filesystems with the following commands:
ssh -l root devcluster01
mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/local
mkdir /nutch/home
groupadd users
useradd -d /nutch/home -g users nutch
chown -R nutch:users /nutch
passwd nutch nutchuserpassword
Again if you don't have root level access you will still need the same user on each machine as the start-all.sh script expects it. It doesn't have to be a user named nutch user although that is what we use. Also you could put the filesystem under the common user's home directory. Basically, you don't have to be root, but it helps.
The start-all.sh script that starts the daemons on the master and slave nodes is going to need to be able to use a password-less login through ssh. For this we are going to have to setup ssh keys on each of the nodes. Since the master node is going to start daemons on itself we also need the ability to user a password-less login on itself.
You might have seen some old tutorials or information floating around the user lists that said you would need to edit the SSH daemon to allow the property and to setup local environment variables for the ssh logins through an environment file. This has changed. We no longer need to edit the ssh daemon and we can setup the environment variables inside of the hadoop-env.sh file. Open the hadoop-env.sh file inside of vi:
cd /nutch/search/conf
vi hadoop-env.sh
Below is a template for the environment variables that need to be changed in the hadoop-env.sh file:
export HADOOP_HOME=/nutch/search
export JAVA_HOME=/usr/java/jdk1.5.0_06
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
There are other variables in this file which will affect the behavior of Hadoop. If when you start running the script later you start getting ssh errors, try changing the HADOOP_SSH_OPTS variable. Note also that, after the initial copy, you can set HADOOP_MASTER in your conf/hadoop-env.sh and it will use rsync changes on the master to each slave node. There is a section below on how to do this.
Next we are going to create the keys on the master node and copy them over to each of the slave nodes. This must be done as the nutch user we created earlier. Don't just su in as the nutch user, start up a new shell and login as the nutch user. If you su in the password-less login we are about to setup will not work in testing but will work when a new session is started as the nutch user.
cd /nutch/home
ssh-keygen -t rsa (Use empty responses for each prompt)
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /nutch/home/.ssh/id_rsa.
Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
The key fingerprint is:
a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc nutch@localhost
On the master node you will copy the public key you just created to a file called authorized_keys in the same directory:
cd /nutch/home/.ssh
cp id_rsa.pub authorized_keys
You only have to run the ssh-keygen on the master node. On each of the slave nodes after the filesystem is created you will just need to copy the keys over using scp.
scp /nutch/home/.ssh/authorized_keys nutch@devcluster02:/nutch/home/.ssh/authorized_keys
You will have to enter the password for the nutch user the first time. An ssh propmt will appear the first time you login to each computer asking if you want to add the computer to the known hosts. Answer yes to the propmt. Once the key is copied you shouldn't have to enter a password when logging in as the nutch user. Test it by logging into the slave nodes that you just copied the keys to:
ssh devcluster02Once we have the ssh keys created we are ready to start deploying nutch to all of the slave nodes.
nutch@devcluster02$ (a command prompt should appear without requiring a password)
hostname (should return the name of the slave node, here devcluster02)
|
问题产生的环境,nutch可以分布式的启动,我要进行分布式的抓取,在分布式文件系统上创建urls文件夹,里面创建url.txt文件,文件内容如下:
|
结果就是抓取结果为0.
问题分析解决中: