一、hadoop部署环境:
centos 5.7
hadoop-0.20.205.0.tar.gz
jdk-6u22-linux-x64.bin
至少3台机器,一台namenode,两台datanode
二、分别在namenode 和datanode 节点安装java 和 hadoop:
1.安装ssh和rsync:
rpm -qa|grep ssh
rpm -qa|grep rsync
2.hosts文件增加:
192.168.7.69 hdp-namenode
192.168.7.67 hdp-datanode67
192.168.7.66 hdp-datanode66
3.root 下安装java 和 hadoop:
#./jdk-6u22-linux-x64.bin
mv jdk1.6.0_22 /usr/java/
#vi /etc/profile
export JAVA_HOME=/usr/java/jdk
export PATH=${JAVA_HOME}/bin:${PATH}
download:
mkdir /usr/local/hadoop
#tar zxvf hadoop-0.20.205.0.tar.gz -C /usr/local/hadoop
#source /etc/profile
4.创建hadoop用户:
#useradd hadoop
#passwd hadoop
在你的.bashrc或者.bash_profile文件里添加如下四行:
#vi /home/hadoop/.bashrc 或 vi /home/hadoop/.bash_profile
export JAVA_HOME=/usr/java/jdk
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_HOME=/usr/local/hadoop
export PATH=${HADOOP_HOME}/bin:${PATH}
#chown hadoop.hadoop /usr/local/hadoop/ -R
#source /home/hadoop/.bashrc 或 source /home/hadoop/.bash_profile
5.ssh 必须安装并且保证 sshd一直运行,以便用Hadoop 脚本管理远端Hadoop守护进程。
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
如果不输入口令就无法用ssh登陆localhost,执行下面的命令:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys2
#chmod 0600 authorized_keys2
互相登录进行主机认证,本机也需要自己能够访问
ssh namenode date
ssh datanode66 date
ssh datanode67 date
三、配置文件修改:
1.hadoop-env.sh
export_JAVA_HOME=/usr/java/jdk1.6.0_22
2.core-site.xml
- <?xml version="1.0"?>
-
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
-
<!-- Put site-specific property overrides in this file. -->
-
-
<configuration>
-
-
<property>
-
<name>fs.default.name</name>
-
<value>hdfs://hdp-namenode:9990</value>
-
<description>The name of the default file system. Either the literal string "local" or a host:port for DFS.</description>
-
</property>
-
-
<property>
-
<name>hadoop.tmp.dir</name>
-
<value>/hadoop_data/tmp</value>
-
<description>A base for other temporary directories.</description>
-
</property>
-
-
</configuration>
3.hdfs-site.xml
dfs.name.dir
/hadoop_data/name
dfs.data.dir
/hadoop_data/data
dfs.replication
1
dfs.block.size
67108864
64M per block size
dfs.hosts.exclude
conf/nn-excluded-list
4.mapred-site.xml
mapred.job.tracker
hdp-namenode:9991
mapred.local.dir
/tmp/mapredlocaldir
mapred.job.tracker.handler.count
20
mapred.map.tasks
2
mapred.reduce.tasks
2
mapred.tasktracker.map.tasks.maximum
20
mapred.tasktracker.reduce.tasks.maximum
20
mapred.child.java.opts
-Xmx450m
mapred.reduce.parallel.copies
20
5.masters
hdp-namenode
6.slaves
hdp-datanode66
hdp-datanode67
6.配置文件同步
rsync_conf.sh
- #!/bin/bash
-
set -x
-
-
files="core-site.xml hdfs-site.xml mapred-site.xml hadoop-env.sh masters slaves"
-
-
hosts="192.168.7.67 192.168.7.66"
-
-
dir="/usr/local/hadoop/conf"
-
-
for host in $hosts
-
do
-
-
rsync $dir/$files $host:$dir
-
-
done
四、启动服务:
注意端口号冲突9000 和php 9000冲突
执行:
格式化一个新的分布式文件系统:
$ bin/hadoop namenode -format
启动Hadoop守护进程:
$ bin/start-all.sh
停止Hadoop守护进程:
$ bin/stop-all.sh
五、在HADOOP集群中添加机器和删除机器:
参考:http://www.cnblogs.com/gpcuster/archive/2011/04/12/2013411.html
#hadoop dfsadmin -safemode leave
perl hadoop测试:
scp -r scripts 192.168.7.66:/usr/local/hadoop/
scp -r scripts 192.168.7.67:/usr/local/hadoop/
perl_test.sh
- #!/bin/bash
-
set -x
-
-
-
current="date +%Y%m%d-%T"
-
-
file="impression2011101200.log.gz"
-
filename=`echo $file|awk -F '.' '{print $1}'`
-
-
workdir="/usr/local/hadoop/scripts"
-
-
hadoop dfs -test -e "logs/$filename.log" >/dev/null 2>&1 &&hadoop dfs -rm "logs/$filename.log"
-
gunzip -c "$workdir/$file" |hadoop dfs -put - "logs/$filename.log"
-
-
if [[ "$?" -ne "0" ]];then
-
echo "`$current` put $filename.log to hdfs failed"
-
else
-
rm "$workdir/$file"
-
hadoop dfs -test -d "results/$filename.log" && hadoop dfs -rmr "results/$filename.log"
-
-
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.205.0.jar \
-
-input "logs/$filename.log" \
-
-mapper "/usr/local/hadoop/scripts/format_impr_log.pl" \
-
-reducer "/usr/local/hadoop/scripts/reduse_impr.pl" \
-
-output "results/$filename.log"
-
-
if [[ "$?" -eq "0" ]];then
-
echo "`$current` hadoop $filename.log ok,success"
-
hadoop dfs -rm "logs/$filename.log"
-
hadoop dfs -get "results/$filename.log/part*" -|gzip >"$workdir/$filename.thin.log.gz"
-
hadoop dfs -rmr "results/$filename.log"
-
#scp "$thinlogdir/$month/$filename.thin.log.gz" $ip:/usr/local/impression_thin_data/$month/&&rm "$thinlogdir/$month/$filename.thin.log.gz"
-
echo "`$current` process $workdir/$filename.log ok"
-
else
-
echo "`$current` hadoop $filename.log ok,failed"
-
fi
-
fi
HTTP server properties
Property name Default value Description
mapred.job.tracker.http.address 0.0.0.0:50030 The jobtracker’s HTTP server address and port.
mapred.task.tracker.http.address 0.0.0.0:50060 The tasktracker’s HTTP server address and port.
dfs.http.address 0.0.0.0:50070 The namenode’s HTTP server address and port.
dfs.datanode.http.address 0.0.0.0:50075 The datanode’s HTTP server address and port.
dfs.secondary.http.address 0.0.0.0:50090 The secondary namenode’s HTTP server address and port.
阅读(1036) | 评论(0) | 转发(0) |