Chinaunix首页 | 论坛 | 博客
  • 博客访问: 789766
  • 博文数量: 434
  • 博客积分: 11187
  • 博客等级: 上将
  • 技术积分: 5221
  • 用 户 组: 普通用户
  • 注册时间: 2009-02-19 01:00
文章分类

全部博文(434)

文章存档

2016年(2)

2013年(1)

2012年(115)

2011年(195)

2010年(32)

2009年(89)

分类: LINUX

2011-10-19 14:11:30

一、hadoop部署环境:
centos 5.7
hadoop-0.20.205.0.tar.gz
jdk-6u22-linux-x64.bin

至少3台机器,一台namenode,两台datanode

二、分别在namenode 和datanode 节点安装java 和 hadoop:

1.安装ssh和rsync:
rpm -qa|grep ssh
rpm -qa|grep rsync

2.hosts文件增加:
192.168.7.69 hdp-namenode
192.168.7.67 hdp-datanode67
192.168.7.66 hdp-datanode66

3.root 下安装java 和 hadoop:
#./jdk-6u22-linux-x64.bin
mv jdk1.6.0_22 /usr/java/

#vi /etc/profile
export JAVA_HOME=/usr/java/jdk
export PATH=${JAVA_HOME}/bin:${PATH}

download:


mkdir /usr/local/hadoop
#tar zxvf hadoop-0.20.205.0.tar.gz -C /usr/local/hadoop

#source /etc/profile

4.创建hadoop用户:
#useradd hadoop
#passwd hadoop

在你的.bashrc或者.bash_profile文件里添加如下四行:
#vi /home/hadoop/.bashrc 或 vi /home/hadoop/.bash_profile
export JAVA_HOME=/usr/java/jdk
export PATH=${JAVA_HOME}/bin:${PATH}

export HADOOP_HOME=/usr/local/hadoop
export PATH=${HADOOP_HOME}/bin:${PATH}

#chown hadoop.hadoop /usr/local/hadoop/ -R

#source /home/hadoop/.bashrc 或 source /home/hadoop/.bash_profile

5.ssh 必须安装并且保证 sshd一直运行,以便用Hadoop 脚本管理远端Hadoop守护进程。

#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

如果不输入口令就无法用ssh登陆localhost,执行下面的命令:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys2

#chmod 0600 authorized_keys2

互相登录进行主机认证,本机也需要自己能够访问
ssh namenode date
ssh datanode66 date
ssh datanode67 date

三、配置文件修改:
1.hadoop-env.sh
export_JAVA_HOME=/usr/java/jdk1.6.0_22

2.core-site.xml
  1. <?xml version="1.0"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

  3. <!-- Put site-specific property overrides in this file. -->

  4. <configuration>

  5. <property>
  6. <name>fs.default.name</name>
  7. <value>hdfs://hdp-namenode:9990</value>
  8. <description>The name of the default file system. Either the literal string "local" or a host:port for DFS.</description>
  9. </property>

  10. <property>
  11. <name>hadoop.tmp.dir</name>
  12. <value>/hadoop_data/tmp</value>
  13. <description>A base for other temporary directories.</description>
  14. </property>

  15. </configuration>
3.hdfs-site.xml







dfs.name.dir
/hadoop_data/name



dfs.data.dir
/hadoop_data/data



dfs.replication
1



dfs.block.size
67108864
64M per block size



dfs.hosts.exclude
conf/nn-excluded-list



4.mapred-site.xml 








  1. mapred.job.tracker
    hdp-namenode:9991



    mapred.local.dir
    /tmp/mapredlocaldir



    mapred.job.tracker.handler.count
    20



    mapred.map.tasks
    2



    mapred.reduce.tasks
    2



    mapred.tasktracker.map.tasks.maximum
    20



    mapred.tasktracker.reduce.tasks.maximum
    20



    mapred.child.java.opts
    -Xmx450m



    mapred.reduce.parallel.copies
    20


5.masters
hdp-namenode

6.slaves
hdp-datanode66
hdp-datanode67

6.配置文件同步
rsync_conf.sh
  1. #!/bin/bash
  2. set -x

  3. files="core-site.xml hdfs-site.xml mapred-site.xml hadoop-env.sh masters slaves"

  4. hosts="192.168.7.67 192.168.7.66"

  5. dir="/usr/local/hadoop/conf"

  6. for host in $hosts
  7. do

  8.         rsync $dir/$files $host:$dir
  9.  
  10. done
四、启动服务:
注意端口号冲突9000 和php 9000冲突

执行:
格式化一个新的分布式文件系统:
$ bin/hadoop namenode -format

启动Hadoop守护进程:
$ bin/start-all.sh

停止Hadoop守护进程:
$ bin/stop-all.sh

五、在HADOOP集群中添加机器和删除机器:
参考:http://www.cnblogs.com/gpcuster/archive/2011/04/12/2013411.html

#hadoop dfsadmin -safemode leave

perl hadoop测试:
scp -r scripts 192.168.7.66:/usr/local/hadoop/
scp -r scripts 192.168.7.67:/usr/local/hadoop/
perl_test.sh
  1. #!/bin/bash
  2. set -x


  3. current="date +%Y%m%d-%T"

  4. file="impression2011101200.log.gz"
  5. filename=`echo $file|awk -F '.' '{print $1}'`

  6. workdir="/usr/local/hadoop/scripts"

  7. hadoop dfs -test -e "logs/$filename.log" >/dev/null 2>&1 &&hadoop dfs -rm "logs/$filename.log"
  8. gunzip -c "$workdir/$file" |hadoop dfs -put - "logs/$filename.log"

  9. if [[ "$?" -ne "0" ]];then
  10.    echo "`$current` put $filename.log to hdfs failed"
  11. else
  12.     rm "$workdir/$file"
  13.     hadoop dfs -test -d "results/$filename.log" && hadoop dfs -rmr "results/$filename.log"

  14.     hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.205.0.jar \
  15.                         -input "logs/$filename.log" \
  16.                         -mapper "/usr/local/hadoop/scripts/format_impr_log.pl" \
  17.                         -reducer "/usr/local/hadoop/scripts/reduse_impr.pl" \
  18.                         -output "results/$filename.log"

  19.         if [[ "$?" -eq "0" ]];then
  20.            echo "`$current` hadoop $filename.log ok,success"
  21.            hadoop dfs -rm "logs/$filename.log"
  22.            hadoop dfs -get "results/$filename.log/part*" -|gzip >"$workdir/$filename.thin.log.gz"
  23.            hadoop dfs -rmr "results/$filename.log"
  24.            #scp "$thinlogdir/$month/$filename.thin.log.gz" $ip:/usr/local/impression_thin_data/$month/&&rm "$thinlogdir/$month/$filename.thin.log.gz"
  25.            echo "`$current` process $workdir/$filename.log ok"
  26.         else
  27.            echo "`$current` hadoop $filename.log ok,failed"
  28.         fi
  29. fi
HTTP server properties
Property name                                Default value                 Description
mapred.job.tracker.http.address   0.0.0.0:50030         The jobtracker’s HTTP server address and port.
mapred.task.tracker.http.address  0.0.0.0:50060         The tasktracker’s HTTP server address and port.
dfs.http.address                0.0.0.0:50070         The namenode’s HTTP server address and port.
dfs.datanode.http.address       0.0.0.0:50075         The datanode’s HTTP server address and port.
dfs.secondary.http.address      0.0.0.0:50090         The secondary namenode’s HTTP server address and port.
阅读(1051) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~