Chinaunix首页 | 论坛 | 博客
  • 博客访问: 489021
  • 博文数量: 137
  • 博客积分: 3874
  • 博客等级: 中校
  • 技术积分: 1475
  • 用 户 组: 普通用户
  • 注册时间: 2010-07-05 10:50
文章分类

全部博文(137)

文章存档

2011年(37)

2010年(100)

分类: LINUX

2010-10-18 22:36:39

我的电脑上装的系统是ubuntu 10.04,装的Hadoop是 hadoop 0.20.2 2010.feb。

1 需要安装的软件:jdk  

sudo apt-get install sun-java6-jdk


jdk安装后的路径为 /usr/lib/jvm/java-6-sun
软装成功后,cml里面输入java会有java命令的使用提示。

2 然后需要为hadoop系统添加用户
我添加的用户是hadoop,属组也是hadoop。命令如下

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hadoop


这样就把用户和组hadoop添加到系统里面了。

3 配置ssh
hadoop是通过ssh来管理节点的,对于单节点的系统,我们只需要ssh localhost就ok了。Ubuntu 下安装 OpenSSH Server 是无比轻松的一件事情,需要的命令只有一条:

sudo apt-get install openssh-server

随后,Ubuntu 会自动下载并安装 openssh server,并一并解决所有的依赖关系。当您完成这一操作后,您可以找另一台计算机,然后使用一个 SSH 客户端软件(强烈推荐 PuTTy),输入您服务器的 IP 地址。如果一切正常的话,等一会儿就可以连接上了。并且使用现有的用户名和密码应该就可以登录了。

事实上如果没什么特别需求,到这里 OpenSSH Server 就算安装好了。但是进一步设置一下,可以让 OpenSSH 登录时间更短,并且更加安全。这一切都是通过修改 openssh 的配置文件 sshd_config 实现的。

首先,您刚才实验远程登录的时候可能会发现,在输入完用户名后需要等很长一段时间才会提示输入密码。其实这是由于 sshd 需要反查客户端的 dns 信息导致的。我们可以通过禁用这个特性来大幅提高登录的速度。首先,打开 sshd_config 文件找到 GSSAPI options 这一节,将下面两行注释掉:

#GSSAPIAuthentication yes
#GSSAPIDelegateCredentials no

然后重新启动 ssh 服务即可:

sudo /etc/init.d/ssh restart


然后我们给hadoop用户生成一个ssh key。

 user@ubuntu:~$ su - hadoop
 hadoop@ubuntu:~$ ssh-keygen -t rsa -P ""
 Generating public/private rsa key pair.
 Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
 Created directory '/home/hadoop/.ssh'.
 Your identification has been saved in /home/hadoop/.ssh/id_rsa.
 Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
 The key fingerprint is:
 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hadoop@ubuntu
 The key


然后,设置允许ssh访问,通过新生成的key。

hadoop@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys


最后就是测试我们新设置的ssh。

hadoop@ubuntu:~$ ssh localhost
 The authenticity of host 'localhost (::1)' can't be established.
 RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
 Are you sure you want to continue connecting (yes/no)? yes
 Warning: Permanently added 'localhost


4 关掉ipv6
修改conf/hadoop-env.sh文件。  

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true


5 安装hadoop
从apache下载镜像里面hadoop的源码。然后解压,我是解压到家目录里面。

 $ sudo tar xzf hadoop-0.20.2.tar.gz
 $ sudo mv hadoop-0.20.2 hadoop
 $ sudo chown -R hadoop:hadoop hadoop

 $ mv hadoop-0.20.2 hadoop


配置hadoop,修改hadoop启动配置
hadoop 20版本前需要配置两个文件:hadoop-default.xml 和 hadoop-site.xml。到了20版本后,这两个文件没有了,改为三个文件:core-site.xml,hdfs-site.xml,mapred-site.xml 。内在的原因是因为hadoop代码量越来越宠大,拆解成三个大的分支进行独立开发,配置文件也独立了。
(1) 修改core-site.xml文件
 设置hadoop需要的一些属性。从/home/hadoopor/hadoop-0.20.2/src/core目录下复制core-default.xml到conf目录下,并改名为core-site.xml。然后修改以下内容
hadoop.tmp.dir
设置临时文件目录参数hadoop.tmp.dir,默认情况下master会将元数据等存在这个目录下,而slave会将所有上传的文件放在这个目录下,我选择的数据目录为:/home/hadoop/hadoop_tmp
注意事项:由于上传到Hadoop的所有文件都会被存放在hadoop.tmp.dir所指定的目录,所以要确保这个目录是足够大的。
fs.default.name
master需要用这个参数,提供基于http协议的状态上报界面,而slave通过这个地址连接master,设置如下:


  fs.default.name
  hdfs://localhost:54310
  The name of the default file system. A URI whose
  scheme and authority determine the FileSystem implementation. The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class. The uri's authority is used to
  determine the host, port, etc. for a filesystem.



(2)修改mapred-site.xml

配置 MapReduce 的一些设置,从/home/hadoopor/hadoop-0.20.2/src/mapred 目录下复制mapred-default.xml到conf目录下,并改名为mapred-site.xml。执行命令同core-site.xml操作完全相似。

修改如下属性配置:


mapred.job.tracker
localhost:54311
The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.


(3) 修改 hdfs-site.xml 文件配置 hdfs的一些设置,从/home/hadoopor/hadoop-0.20.2/src/hdfs目录下复制hdfs-default.xml到conf目录下,并改名为hdfs-site.xml。不需要修改此文件。
(4)修改 masters 和 slaves 文件:配置,文件中写入作为master机器和slaves机器的IP地址,如果是单机,都写localhost即可。

6 格式化namenode

hadoop@ubuntu:~$ <HADOOP_INSTALL>/hadoop/bin/hadoop namenode -format


输出应该是:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
 10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
 /************************************************************
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host = ubuntu/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hadoop/dfs/name has been successfully formatted.
 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
 /************************************************************
 SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
 ************************************************************/
 hadoop@ubuntu:/usr/local/hadoop$


7 启动
运行如下代码

hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/start-all.sh

输出应该是


hadoop@ubuntu:/usr/local/hadoop$ bin/start-all.sh
 starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out
 localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-ubuntu.out
 localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out
 starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ubuntu.out
 localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ubuntu.out
 hadoop@ubuntu:/usr/local/hadoop$


成功后,运行jps,应该看到如下输出:

 hadoop@ubuntu:/usr/local/hadoop$ jps
 2287 TaskTracker
 2149 JobTracker
 1938 DataNode
 2085 SecondaryNameNode
 2349 Jps
 1788 NameNode


8 停止
运行命令

hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/stop-all.sh

输出应该如下:


hadoop@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
 stopping jobtracker
 localhost: stopping tasktracker
 stopping namenode
 localhost: stopping datanode
 localhost: stopping secondarynamenode
 hadoop@ubuntu:/usr/local/hadoop$


到此,我们的配置就完成了。
下面部分可以进行测试,我就不赘述了。也不翻译了。。。

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.

Download example input data

We will use three ebooks from Project Gutenberg for this example:

Download each ebook as plain text files in us-ascii encoding and store the uncompressed files in a temporary directory of choice, for example /tmp/gutenberg.

 hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
 total 3592
 -rw-r--r-- 1 hadoop hadoop  674425 2007-01-22 12:56 20417-8.txt
 -rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
 -rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
 hadoop@ubuntu:~$

Restart the Hadoop cluster

Restart your Hadoop cluster if it's not running already.

 hadoop@ubuntu:~$ /bin/start-all.sh

Copy local example data to HDFS

Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop's HDFS.

 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
 Found 1 items
 drwxr-xr-x   - hadoop supergroup          0 2010-05-08 17:40 /user/hadoop/gutenberg
 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
 Found 3 items
 -rw-r--r--   1 hadoop supergroup     674762 2010-05-08 17:40 /user/hadoop/gutenberg/20417.txt
 -rw-r--r--   1 hadoop supergroup    1573044 2010-05-08 17:40 /user/hadoop/gutenberg/4300.txt
 -rw-r--r--   1 hadoop supergroup    1391706 2010-05-08 17:40 /user/hadoop/gutenberg/7ldvc10.txt
 hadoop@ubuntu:/usr/local/hadoop$

Run the MapReduce job

Now, we actually run the WordCount example job.

 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output

This command will read all the files in the HDFS directory gutenberg, process it, and store the result in the HDFS directorygutenberg-output.

Exemplary output of the previous command in the console:

 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output
 10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
 10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
 10/05/08 17:43:02 INFO mapred.JobClient:  map 0% reduce 0%
 10/05/08 17:43:14 INFO mapred.JobClient:  map 66% reduce 0%
 10/05/08 17:43:17 INFO mapred.JobClient:  map 100% reduce 0%
 10/05/08 17:43:26 INFO mapred.JobClient:  map 100% reduce 100%
 10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
 10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
 10/05/08 17:43:28 INFO mapred.JobClient:   Job Counters 
 10/05/08 17:43:28 INFO mapred.JobClient:     Launched reduce tasks=1
 10/05/08 17:43:28 INFO mapred.JobClient:     Launched map tasks=3
 10/05/08 17:43:28 INFO mapred.JobClient:     Data-local map tasks=3
 10/05/08 17:43:28 INFO mapred.JobClient:   FileSystemCounters
 10/05/08 17:43:28 INFO mapred.JobClient:     FILE_BYTES_READ=2214026
 10/05/08 17:43:28 INFO mapred.JobClient:     HDFS_BYTES_READ=3639512
 10/05/08 17:43:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3687918
 10/05/08 17:43:28 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=880330
 10/05/08 17:43:28 INFO mapred.JobClient:   Map-Reduce Framework
 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce input groups=82290
 10/05/08 17:43:28 INFO mapred.JobClient:     Combine output records=102286
 10/05/08 17:43:28 INFO mapred.JobClient:     Map input records=77934
 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce shuffle bytes=1473796
 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce output records=82290
 10/05/08 17:43:28 INFO mapred.JobClient:     Spilled Records=255874
 10/05/08 17:43:28 INFO mapred.JobClient:     Map output bytes=6076267
 10/05/08 17:43:28 INFO mapred.JobClient:     Combine input records=629187
 10/05/08 17:43:28 INFO mapred.JobClient:     Map output records=629187
 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce input records=102286 

Check if the result is successfully stored in HDFS directory gutenberg-output:

 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls 
 Found 2 items
 drwxr-xr-x   - hadoop supergroup          0 2010-05-08 17:40 /user/hadoop/gutenberg
 drwxr-xr-x   - hadoop supergroup          0 2010-05-08 17:43 /user/hadoop/gutenberg-output
 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg-output
 Found 2 items
 drwxr-xr-x   - hadoop supergroup          0 2010-05-08 17:43 /user/hadoop/gutenberg-output/_logs
 -rw-r--r--   1 hadoop supergroup     880330 2010-05-08 17:43 /user/hadoop/gutenberg-output/part-r-00000
 hadoop@ubuntu:/usr/local/hadoop$ 

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount -D mapred.reduce.tasks=16 gutenberg gutenberg-output
An important note about mapred.map.tasksHadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot forcemapred.map.tasks but can specify mapred.reduce.tasks.

Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenberg-output/part-r-00000

to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

 hadoop@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge gutenberg-output /tmp/gutenberg-output
hadoop@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output 
 "(Lo)cra"       1
 "1490   1
 "1498," 1
 "35"    1
 "40,"   1
 "A      2
 "AS-IS".        1
 "A_     1
 "Absoluti       1
 "Alack! 1
 hadoop@ubuntu:/usr/local/hadoop$ 

Note that in this specific output the quote signs (") enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.


参考:

http://www.cppblog.com/thronds/archive/2008/11/17/67153.html

http://hi.baidu.com/pwcrab/blog/item/3cd63086fcd3733067096e95.html

阅读(2097) | 评论(1) | 转发(0) |
0

上一篇:白话公钥密钥

下一篇:poj 1821 Fence

给主人留下些什么吧!~~

chinaunix网友2010-10-19 16:15:13

很好的, 收藏了 推荐一个博客,提供很多免费软件编程电子书下载: http://free-ebooks.appspot.com