Chinaunix首页 | 论坛 | 博客
  • 博客访问: 3045875
  • 博文数量: 535
  • 博客积分: 15788
  • 博客等级: 上将
  • 技术积分: 6507
  • 用 户 组: 普通用户
  • 注册时间: 2007-03-07 09:11
文章分类

全部博文(535)

文章存档

2016年(1)

2015年(1)

2014年(10)

2013年(26)

2012年(43)

2011年(86)

2010年(76)

2009年(136)

2008年(97)

2007年(59)

分类: 云计算

2011-10-28 14:24:20

一篇介绍hadoop lzo的文档:

1、在所有node安装lzo的动态链接库文件。
下载:
download/lzo-2.06.tar.gz
安装
./configure --enable-shared && make && make install
为so文件做软链接
ln -s /usr/local/lib/liblzo2.* /usr/lib64/

2、安装lzop
lzop是linux命令行下用于压缩及解压缩lzo文件的工作
下载:
download/lzop-1.03.tar.gz
./configure
make
make install
或者直接 yum install lzop

3、安装apache-ant
tar -xvzf apache-ant-1.8.2-bin.tar.gz  
mv apache-ant-1.8.2 /usr/local/

vi /etc/profile
  1. export ANT_HOME=/usr/local/apache-ant-1.8.2
  2. export PATH=$PATH:$ANT_HOME/bin
source /etc/profile

4、安装hadoop lzo的动态链接库及jar包
下载
/downloads
  1. export CFLAGS=-m64
  2. export CXXFLAGS=-m64
  3. export JAVA_HOME=/usr/java/jdk1.6.0_22
  4. ant compile-native tar
编译完成:
BUILD SUCCESSFUL
Total time: 44 seconds

# Copy the jar file
  1. cp build/hadoop-lzo-0.4.14/hadoop-lzo-0.4.14.jar /usr/local/hadoop/lib
# Copy the native library
  1. tar -cBf - -C build/hadoop-lzo-0.4.14/lib/native . | tar -xBvf - -C /usr/local/hadoop/lib/native
注意:java也需要使用64位的,否则将拷贝32位的文件

执行后目录内容为:
[root@srv145 native]# ll Linux-amd64-64/
total 1072
-rw-r--r-- 1 hadoop hadoop 104078 Oct 28 10:18 libgplcompression.a
-rw-r--r-- 1 hadoop hadoop   1140 Oct 28 10:18 libgplcompression.la
lrwxrwxrwx 1 hadoop hadoop     26 Oct 28 10:20 libgplcompression.so -> libgplcompression.so.0.0.0
lrwxrwxrwx 1 hadoop hadoop     26 Oct 28 10:20 libgplcompression.so.0 -> libgplcompression.so.0.0.0
-rwxr-xr-x 1 hadoop hadoop  68377 Oct 28 10:18 libgplcompression.so.0.0.0
-rw-rw-r-- 1 hadoop hadoop 317850 Aug 26 07:35 libhadoop.a
-rw-rw-r-- 1 hadoop hadoop    878 Aug 26 07:35 libhadoop.la
-rw-rw-r-- 1 hadoop hadoop 175902 Aug 26 07:35 libhadoop.so
-rw-rw-r-- 1 hadoop hadoop 175902 Aug 26 07:35 libhadoop.so.1
-rw-rw-r-- 1 hadoop hadoop 175902 Aug 26 07:35 libhadoop.so.1.0.0

hadoop-env.sh添加
  1. export HADOOP_CLASSPATH=/usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar

hadoop lzo的文档上将 JAVA_LIBRARY_PATH也写到这个文件里,对于20.204.0版本实际上是不起作用的,因为:
hadoop-daemon.sh
先执行hadoop-env.sh
然后执行"$HADOOP_PREFIX"/bin/hadoop

bin/hadoop命令中包含:
  1. if [ -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then
  2. JAVA_LIBRARY_PATH=${HADOOP_PREFIX}/lib
  3. fi
因此,JAVA_LIBRARY_PATH被设置为/usr/local/hadoop/lib
所以,我们修改为:
  1. if [ -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then
  2. JAVA_LIBRARY_PATH=${HADOOP_PREFIX}/lib:/usr/local/hadoop/lib/native/Linux-amd64-64
  3. fi

补充,对于cdh3版本,只需要添加export HADOOP_CLASSPATH=/usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar即可JAVA_LIBRARY_PATH不需要设置


5、修改hadoop配置文件
core-site.xml
  1. <property>
  2. <name>io.compression.codecs</name>
  3. <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
  4. </property>
  5.   
  6. <property>
  7. <name>io.compression.codec.lzo.class</name>
  8. <value>com.hadoop.compression.lzo.LzoCodec</value>
  9. </property>

mapred-site.xml
  1. <property>
  2.     <name>mapred.compress.map.output</name>
  3.     <value>true</value>
  4. </property>

  5. <property>
  6.     <name>mapred.map.output.compression.codec</name>
  7.     <value>com.hadoop.compression.lzo.LzoCodec</value>
  8. </property>


6、重启集群
stop-all.sh
start-all.sh

7、测试
上传lzo文件后,需要为lzo文件创建索引:
两种方式:
方法1:单进程创建索引
  1. hadoop jar /usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar com.hadoop.compression.lzo.LzoIndexer logs/201108/impression_witspixel2011080116.thin.log.lzo

方法2:该方法会启动1个map-reduce job来创建所以,即将文件按分成的块来创建所以,速度更快。 
  1. hadoop jar /usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar com.hadoop.compression.lzo.DistributedLzoIndexer logs/201108/impression_witspixel2011080116.thin.log.lzo

8、使用hadoop streaming
需要添加参数   -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat 
否则让然按单个map来执行job
Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat"
Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient.

下面的运行job的方法有问题
  1. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.204.0.jar \
  2. -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
  3. -input "logs/201108/1.lzo" \
  4. -mapper "/usr/local/hadoop/scripts/guid.pl'" \
  5. -reducer "/usr/local/hadoop/scripts/reduse_impr.pl" \
  6. -output "results/1"
注:按照上面的方法运行job,计算出来的值异常,当不使用-inputformat时结果正常,但是,只有一个map。
问题出在:com.hadoop.mapred.DeprecatedLzoTextInputFormat上。
官网上:
it adds the ability to work with Hadoop streaming via the com.apache.hadoop.mapred.DeprecatedLzoTextInputFormat class
但是指定为该class时报错,找不到该class。

查看源码:

/**
* This class conforms to the old (org.apache.hadoop.mapred.*) hadoop API style
* which is deprecated but still required in places. Streaming, for example,
* does a check that the given input format is a descendant of
* org.apache.hadoop.mapred.InputFormat, which any InputFormat-derived class
* from the new API fails. In order for streaming to work, you must use
* com.hadoop.mapred.DeprecatedLzoTextInputFormat, not
* com.hadoop.mapreduce.LzoTextInputFormat. The classes attempt to be alike in
* every other respect.
*
* Note that to use this input format properly with hadoop-streaming, you should
* also set the property stream.map.input.ignoreKey=true. That will
* replicate the behavior of the default TextInputFormat by stripping off the byte
* offset keys from the input lines that get piped to the mapper process.
*
* See {@link LzoInputFormatCommon} for a description of the boolean property
* lzo.text.input.format.ignore.nonlzo and how it affects the
* behavior of this input format.
*/

所以:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-D stream.map.input.ignoreKey=true \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-input  "logs/201108/impression_witspixel2011083105.thin.log.lzo"  \
-mapper "/usr/local/hadoop/scripts/guid.pl'"  \
-reducer "/usr/local/hadoop/scripts/reduse_impr.pl" \
-output "results/201108"

注:这里使用的是CDH3版本,apache的版本未测试。

阅读(4574) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~