在单机上部署好hadoop与hbase之后,现在终于要开始在集群上来部署了。一路配置下来,个人觉得其实分布式配置与单机配置差不多,但是修改了配置文件中的一些参数适应分布式。当然这只是简单地配置部署,如果真的要作为一个实际系统来使用,考虑性能稳定性等其他因素的时候当然就没有这么简单了,但是今天的主要工作是部署一个分布式Hadoop/Hbase,先不考虑性能优化的问题,而且我们的节点数也很小,初期只有一个master和两个slaves:
master: 30.0.0.69
node1: 30.0.0.161
node2: 30.0.0.162
整体的配置可以分为分布式条件配置、hadoop配置文件、hbase配置文件三个步骤、安装文件迁移到其他节点。当然在正式开始之前,你要确保各个节点上都有一个hadoop用户(或者其他同名的用户都可以),可以在安装系统的时候默认使用该用户安装,或者使用adduser添加一个吧!不过这里要确保该用户可以使用sudo命令,因此最好加入到%admin组里,必要的时候需要修改/etc/sudoers文件,具体配置说明可以见:
http://blog.chinaunix.net/uid-26275986-id-3940725.html中有说明。
一、分布式条件配置
这里自己所说的分布式配置,主要是关于集群之间通信的时候的配置,主要涉及网卡配置、hosts配置和ssh免登陆设置。
1. 网卡配置
自己的安装系统都是Ubuntu-12.04-desktop,对于Ubuntu的网卡配置来说,可以使用ifconfig命令来配置,或者用右上角的网络工具,但是自己还是习惯使用配置文件的方法:
-
auto lo
-
iface lo inet loopback
-
-
#auto eth0
-
#iface eth0 inet dhcp
-
-
auto eth0
-
iface eth0 inet static
-
-
address 30.0.0.69
-
netmask 255.255.255.0
-
gateway 30.0.0.254
-
#dns-nameservers DNS-IP
上面是网卡接口的配置文件,需要注意的有两点:一是通过修改该配置文件的方式需要sudo /etc/init.d/networking restart才能立即生效,否则重启后也可以使用新配置;二是Ubuntu不能直接设置/etc/resolv.conf配置文件来设定DNS,因为每次重启后都会清空,如果要设置DNS,要么在网卡配置文件中(如本例)添加最后一行(本例不需要DNS,因此注释掉了),或者去修改/etc/resolvconf/resolv.conf.d.base文件,然后重启网卡服务即可生效。
2. 主机hosts配置
/etc/hosts文件用于确定集群中每个节点的IP,为了方便后续集群可以正确通信,这里要进行设置,并且在每一个节点上都要配置该文件:
-
127.0.0.1 localhost
-
#127.0.0.1 hadoop
-
#127.0.0.1 master
-
-
-
30.0.0.69 master
-
30.0.0.161 node1
-
30.0.0.162 node2
-
# The following lines are desirable for IPv6 capable hosts
-
#::1 ip6-localhost ip6-loopback
-
#fe00::0 ip6-localnet
-
#ff00::0 ip6-mcastprefix
-
#ff02::1 ip6-allnodes
-
#ff02::2 ip6-allrouters
自己不使用IPv6,因此把相应地址都注释掉了。本文件配置好后可以复制到其他节点。
3. 配置ssh免登陆
由于hadoop各节点间使用ssh通讯,因此为了避免频繁密码验证,这里需要设置ssh免密码登录。其实ssh免登陆设置很简单,只需要被登录节点保存有登录节点用户的公钥即可。
第一步:ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa,这步之后会在~/.ssh中添加公私钥文件:
第二步:将公钥导入目标机器的认证文件中:cat id_dsa.pub >> authorized_keys
当然,这里需要做的是将master的公钥文件导入到node1和node2中的.ssh中的authorized_keys中。但是自己在测试的时候出了问题,node2可以ssh其他节点,但是其他节点不能ssh回来。费了一番功夫后发觉各个节点的版本不一致,node2使用的是系统自带的版本,估计没有server部分,需要更新最新版本。为了方便起见,这步设置的时候建议统一sudo apt-get install ssh即可。
这些步骤完成之后,可以相互ssh测试,第一次还需要输入密码,但是第二次之后就可以免密码登录了。
二、Hadoop分布式配置文件设置
当然分布式也要求单机版hadoop运行的必要条件,jdk包肯定是必需的。保证各个节点上hadoop运行用户和hadoop/hbase存放目录的一致,后期的拷贝会省去很多麻烦。然后下载hadoop包,解压之后开始修改配置文件:
1. 配置jdk运行路径,修改hadoop-env.xml
-
# Set Hadoop-specific environment variables here.
-
-
# The only required environment variable is JAVA_HOME. All others are
-
# optional. When running a distributed configuration it is best to
-
# set JAVA_HOME in this file, so that it is correctly defined on
-
# remote nodes.
-
-
# The java implementation to use. Required.
-
export JAVA_HOME=/home/hadoop/platform/jdk1.6.0_35
-
-
# Extra Java CLASSPATH elements. Optional.
-
# export HADOOP_CLASSPATH=
-
-
# The maximum amount of heap to use, in MB. Default is 1000.
-
#export HADOOP_HEAPSIZE=1000
-
-
# Extra Java runtime options. Empty by default.
-
# export HADOOP_OPTS=-server
-
-
# Command specific options appended to HADOOP_OPTS when specified
-
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
-
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
-
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
-
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
-
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
-
# export HADOOP_TASKTRACKER_OPTS=
-
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
-
#export HADOOP_CLIENT_OPTS="-Xmx1024m $HADOOP_CLIENT_OPTS"
-
-
# Extra ssh options. Empty by default.
-
# export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"
-
-
# Where log files are stored. $HADOOP_HOME/logs by default.
-
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
-
-
# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
-
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
-
-
# host:path where hadoop code should be rsync'd from. Unset by default.
-
# export HADOOP_MASTER=master:/home/$USER/src/hadoop
-
-
# Seconds to sleep between slave commands. Unset by default. This
-
# can be useful in large clusters, where, e.g., slave rsyncs can
-
# otherwise arrive faster than the master can service them.
-
# export HADOOP_SLAVE_SLEEP=0.1
-
-
# The directory where pid files are stored. /tmp by default.
-
# export HADOOP_PID_DIR=/var/hadoop/pids
-
-
# A string representing this instance of hadoop. $USER by default.
-
# export HADOOP_IDENT_STRING=$USER
-
-
# The scheduling priority for daemon processes. See 'man nice'.
-
# export HADOOP_NICENESS=10
2. 配置namenode,修改core-site.xml
-
<?xml version="1.0"?>
-
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
-
<!-- Put site-specific property overrides in this file. -->
-
-
<configuration>
-
<property>
-
<name>fs.default.name</name>
-
<value>hdfs://master:9000</value>
-
</property>
-
<property>
-
<name>hadoop.tmp.dir</name>
-
<value>/home/hadoop/hdfs/tmp</value>
-
</property>
-
-
</configuration>
3. 配置hdfs,修改hdfs-site.xml文件:
-
<?xml version="1.0"?>
-
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
-
<!-- Put site-specific property overrides in this file. -->
-
-
<configuration>
-
<property>
-
<name>dfs.name.dir</name>
-
<value>/home/hadoop/hdfs/name</value>
-
<final>true</final>
-
</property>
-
<property>
-
<name>dfs.data.dir</name>
-
<value>/home/hadoop/hdfs/data</value>
-
<final>true</final>
-
</property>
-
<property>
-
<name>dfs.datanode.max.xcievers</name>
-
<value>32768</value>
-
</property>
-
<property>
-
<name>dfs.replication</name> //有几个datanode最多设置几个数值,每个datanode只能保存一份备份
-
<value>2</value>
-
<final>true</final>
-
</property>
-
-
</configuration>
这里注意final标签表示本设置在后续运行中不允许动态修改和覆盖。
4. 配置map-reduce,修改mapred-site.xml:
-
<?xml version="1.0"?>
-
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
-
<!-- Put site-specific property overrides in this file. -->
-
-
<configuration>
-
<property>
-
<name>mapred.job.tracker</name>
-
<value>30.0.0.69:9001</value>
-
</property>
-
<property>
-
<name>mapred.child.java.opts</name>
-
<value>-Xmx800m</value> //根据自身机器情况设置大小,自己的内存只有1G
-
<final>true</final>
-
</property>
-
-
</configuration>
5. 根据实际情况配置masters和slaves文件,在master文件中写入master角色的主机名(master),slaves文件中写入datanode角色的主机名(node1、node2)
6. 以上配置在master上配置,然后将配置好的hadoop目录直接拷贝到其余各节点,注意保证各节点hadoop运行用户、hadoop存放位置的一致,不要忘记在各机上设置hosts文件;
7. 运行测试:
在master上启动hadoop即可,然后用Jps命令查看启动进程,TaskTracker和Datanode应该都在node1和node2上运行:
然后通过web->可以查看hdfs系统:可以看到Live Nodes为2
如果想查看mapreduce的任务情况,web->
自己很奇怪为什么自己的Nodes项为0?明明在分节点上TaskTracker都已经启动了。上网去查资料,好像比较常见,一般的解决方法有:
1-关闭namenode的safemode:hadoop dfsadmin -safemode leave;
2-格式化hdfs:删除各个几点的hdfs目录,即hdfs-site文件中指定的tmp目录;
3-查看防火墙:sudo ufw status
这些方法都不管用,后来突然想到是不是因为没有运行的mapreduce呢?于是赶紧测试hadoop:
果然不出所料,这次结果出现了两个节点,看来果然是自己的理解有问题,这里只能显示正在运行的map-reduce任务:
至此hadoop分布式的配置基本结束,下面开始配置hbase。
三、Hbase配置
配置Hbase其实和Hadoop是类似的思路,先在HMaster上配置好文件,保证用户和路径的一致的前提下,直接复制拷贝hbase安装目录就可以了:
1. 配置hbase运行环境,修改hbase-env.xml:
-
#
-
#/**
-
# * Copyright 2007 The Apache Software Foundation
-
# *
-
# * Licensed to the Apache Software Foundation (ASF) under one
-
# * or more contributor license agreements. See the NOTICE file
-
# * distributed with this work for additional information
-
# * regarding copyright ownership. The ASF licenses this file
-
# * to you under the Apache License, Version 2.0 (the
-
# * "License"); you may not use this file except in compliance
-
# * with the License. You may obtain a copy of the License at
-
# *
-
# * http://
-
# *
-
# * Unless required by applicable law or agreed to in writing, software
-
# * distributed under the License is distributed on an "AS IS" BASIS,
-
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-
# * See the License for the specific language governing permissions and
-
# * limitations under the License.
-
# */
-
-
# Set environment variables here.
-
-
# The java implementation to use. Java 1.6 required.
-
export JAVA_HOME=/home/hadoop/platform/jdk1.6.0_35
-
export HBASE_HOME=/home/hadoop/platform/hbase-0.90.0
-
export HADOOP_HOME=/home/hadoop/platform/hadoop-1.0.3
-
# Extra Java CLASSPATH elements. Optional.
-
# export HBASE_CLASSPATH=
-
-
# The maximum amount of heap to use, in MB. Default is 1000.
-
#export HBASE_HEAPSIZE=1000
-
-
# Extra Java runtime options.
-
# Below are what we set by default. May only work with SUN JVM.
-
# For more on why as well as other possible settings,
-
# see http://wiki.apache.org/hadoop/PerformanceTuning
-
export HBASE_OPTS="$HBASE_OPTS -ea -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
-
-
# Uncomment below to enable java garbage collection logging.
-
# export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log"
-
-
# Uncomment and adjust to enable JMX exporting
-
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
-
# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
-
#
-
# export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
-
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
-
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
-
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
-
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
-
-
# File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
-
# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
-
-
# Extra ssh options. Empty by default.
-
# export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR"
-
-
# Where log files are stored. $HBASE_HOME/logs by default.
-
export HBASE_LOG_DIR=${HBASE_HOME}/logs
-
-
# A string representing this instance of hbase. $USER by default.
-
# export HBASE_IDENT_STRING=$USER
-
-
# The scheduling priority for daemon processes. See 'man nice'.
-
# export HBASE_NICENESS=10
-
-
# The directory where pid files are stored. /tmp by default.
-
# export HBASE_PID_DIR=/var/hadoop/pids
-
-
# Seconds to sleep between slave commands. Unset by default. This
-
# can be useful in large clusters, where, e.g., slave rsyncs can
-
# otherwise arrive faster than the master can service them.
-
# export HBASE_SLAVE_SLEEP=0.1
-
-
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
-
export HBASE_MANAGES_ZK=true
2. 配置hdfs,修改hbase-site.xml:
-
<?xml version="1.0"?>
-
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
<!--
-
/**
-
* Copyright 2010 The Apache Software Foundation
-
*
-
* Licensed to the Apache Software Foundation (ASF) under one
-
* or more contributor license agreements. See the NOTICE file
-
* distributed with this work for additional information
-
* regarding copyright ownership. The ASF licenses this file
-
* to you under the Apache License, Version 2.0 (the
-
* "License"); you may not use this file except in compliance
-
* with the License. You may obtain a copy of the License at
-
*
-
* http://
-
*
-
* Unless required by applicable law or agreed to in writing, software
-
* distributed under the License is distributed on an "AS IS" BASIS,
-
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-
* See the License for the specific language governing permissions and
-
* limitations under the License.
-
*/
-
-->
-
<configuration>
-
<property>
-
<name>hbase.rootdir</name>
-
<value>hdfs://master:9000/hbase</value>
-
</property>
-
<property>
-
<name>hbase.cluster.distributed</name>
-
<value>true</value>
-
</property>
-
<property>
-
<name>hbase.master</name>
-
<value>30.0.0.69:60000</value>
-
</property>
-
-
<property>
-
<name>hbase.zookeeper.quorum</name>
-
<value>30.0.0.161,30.0.0.162</value>
-
</property>
-
<property>
-
<name>zookeeper.znode.parent</name>
-
<value>/hbase</value> //默认位置
-
</property>
-
-
</configuration>
这里主要是设置hdfs系统的根位置以及HMaster的位置。
3. 拷贝hbase目录下的src/main/resources/下的hbase-default.xml文件,修改hbase.rootdir写定HDFS目录和hbase.cluster.distributed两项:
-
<?xml version="1.0"?>
-
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
<!--
-
/**
-
* Copyright 2009 The Apache Software Foundation
-
*
-
* Licensed to the Apache Software Foundation (ASF) under one
-
* or more contributor license agreements. See the NOTICE file
-
* distributed with this work for additional information
-
* regarding copyright ownership. The ASF licenses this file
-
* to you under the Apache License, Version 2.0 (the
-
* "License"); you may not use this file except in compliance
-
* with the License. You may obtain a copy of the License at
-
*
-
* http://
-
*
-
* Unless required by applicable law or agreed to in writing, software
-
* distributed under the License is distributed on an "AS IS" BASIS,
-
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-
* See the License for the specific language governing permissions and
-
* limitations under the License.
-
*/
-
-->
-
<configuration>
-
<property>
-
<name>hbase.rootdir</name>
-
<value>hdfs://master:9000/hbase</value>
-
<description>The directory shared by region servers and into
-
which HBase persists. The URL should be 'fully-qualified'
-
to include the filesystem scheme. For example, to specify the
-
HDFS directory '/hbase' where the HDFS instance's namenode is
-
running at namenode.example.org on port 9000, set this value to:
-
hdfs://namenode.example.org:9000/hbase. By default HBase writes
-
into /tmp. Change this configuration else all data will be lost
-
on machine restart.
-
</description>
-
</property>
-
<property>
-
<name>hbase.master.port</name>
-
<value>60000</value>
-
<description>The port the HBase Master should bind to.</description>
-
</property>
-
<property>
-
<name>hbase.cluster.distributed</name>
-
<value>true</value>
-
<description>The mode the cluster will be in. Possible values are
-
false for standalone mode and true for distributed mode. If
-
false, startup will run all HBase and ZooKeeper daemons together
-
in the one JVM.
-
</description>
-
</property>
-
<property>
-
<name>hbase.tmp.dir</name>
-
<value>/tmp/hbase-${user.name}</value>
-
<description>Temporary directory on the local filesystem.
-
Change this setting to point to a location more permanent
-
than '/tmp' (The '/tmp' directory is often cleared on
-
machine restart).
-
</description>
-
</property>
-
<property>
-
其余部分略
4. 类似于hadoop的master与slaves,这里也需要编辑各个节点上的HMasters与HRegionServers,直接添加节点名称即可;
5. 将hbase目录拷贝到其余各个节点相同位置,至此hbase配置基本完成;
6. 运行测试start-hbase:
使用hbase shell命令建表:
通过WEB页面查看HMaster:
PS:
配置分布式hadoop/hbase一定要把Master和HMaster的IP搞清楚,中间的配置文件不要有错,否则后续的服务是不能正常启动的。另外整体的思路是:
1. 全节点配置hosts、ssh,安装jps;建立相同的hadoop运行用户,相同的文件路径;
2. Master/HMaster上配置hadoop/hbase;
3. 将hadoop/hbase目录直接迁移到各节点,测试可用;
阅读(781) | 评论(0) | 转发(0) |