一篇介绍hadoop lzo的文档:
1、在所有node安装lzo的动态链接库文件。
下载:
download/lzo-2.06.tar.gz
安装
./configure --enable-shared && make && make install
为so文件做软链接
ln -s /usr/local/lib/liblzo2.* /usr/lib64/
2、安装lzop
lzop是linux命令行下用于压缩及解压缩lzo文件的工作
下载:
download/lzop-1.03.tar.gz
./configure
make
make install
或者直接 yum install lzop
3、安装apache-ant
tar -xvzf apache-ant-1.8.2-bin.tar.gz
mv apache-ant-1.8.2 /usr/local/
vi /etc/profile
-
export ANT_HOME=/usr/local/apache-ant-1.8.2
-
export PATH=$PATH:$ANT_HOME/bin
source /etc/profile
4、安装hadoop lzo的动态链接库及jar包
下载
/downloads
-
export CFLAGS=-m64
-
export CXXFLAGS=-m64
-
export JAVA_HOME=/usr/java/jdk1.6.0_22
-
ant compile-native tar
编译完成:
BUILD SUCCESSFUL
Total time: 44 seconds
# Copy the jar file
-
cp build/hadoop-lzo-0.4.14/hadoop-lzo-0.4.14.jar /usr/local/hadoop/lib
# Copy the native library
-
tar -cBf - -C build/hadoop-lzo-0.4.14/lib/native . | tar -xBvf - -C /usr/local/hadoop/lib/native
注意:java也需要使用64位的,否则将拷贝32位的文件
执行后目录内容为:
[root@srv145 native]# ll Linux-amd64-64/
total 1072
-rw-r--r-- 1 hadoop hadoop 104078 Oct 28 10:18 libgplcompression.a
-rw-r--r-- 1 hadoop hadoop 1140 Oct 28 10:18 libgplcompression.la
lrwxrwxrwx 1 hadoop hadoop 26 Oct 28 10:20 libgplcompression.so -> libgplcompression.so.0.0.0
lrwxrwxrwx 1 hadoop hadoop 26 Oct 28 10:20 libgplcompression.so.0 -> libgplcompression.so.0.0.0
-rwxr-xr-x 1 hadoop hadoop 68377 Oct 28 10:18 libgplcompression.so.0.0.0
-rw-rw-r-- 1 hadoop hadoop 317850 Aug 26 07:35 libhadoop.a
-rw-rw-r-- 1 hadoop hadoop 878 Aug 26 07:35 libhadoop.la
-rw-rw-r-- 1 hadoop hadoop 175902 Aug 26 07:35 libhadoop.so
-rw-rw-r-- 1 hadoop hadoop 175902 Aug 26 07:35 libhadoop.so.1
-rw-rw-r-- 1 hadoop hadoop 175902 Aug 26 07:35 libhadoop.so.1.0.0
hadoop-env.sh添加
-
export HADOOP_CLASSPATH=/usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar
hadoop lzo的文档上将 JAVA_LIBRARY_PATH也写到这个文件里,对于20.204.0版本实际上是不起作用的,因为:
hadoop-daemon.sh
先执行hadoop-env.sh
然后执行"$HADOOP_PREFIX"/bin/hadoop
bin/hadoop命令中包含:
-
if [ -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then
-
JAVA_LIBRARY_PATH=${HADOOP_PREFIX}/lib
-
fi
因此,JAVA_LIBRARY_PATH被设置为/usr/local/hadoop/lib
所以,我们修改为:
-
if [ -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then
-
JAVA_LIBRARY_PATH=${HADOOP_PREFIX}/lib:/usr/local/hadoop/lib/native/Linux-amd64-64
-
fi
补充,对于cdh3版本,只需要添加export HADOOP_CLASSPATH=/usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar即可,JAVA_LIBRARY_PATH不需要设置
5、修改hadoop配置文件
core-site.xml
-
<property>
-
<name>io.compression.codecs</name>
-
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
-
</property>
-
-
<property>
-
<name>io.compression.codec.lzo.class</name>
-
<value>com.hadoop.compression.lzo.LzoCodec</value>
-
</property>
mapred-site.xml
-
<property>
-
<name>mapred.compress.map.output</name>
-
<value>true</value>
-
</property>
-
-
<property>
-
<name>mapred.map.output.compression.codec</name>
-
<value>com.hadoop.compression.lzo.LzoCodec</value>
-
</property>
6、重启集群
stop-all.sh
start-all.sh
7、测试
上传lzo文件后,需要为lzo文件创建索引:
两种方式:
方法1:单进程创建索引
-
hadoop jar /usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar com.hadoop.compression.lzo.LzoIndexer logs/201108/impression_witspixel2011080116.thin.log.lzo
方法2:该方法会启动1个map-reduce job来创建所以,即将文件按分成的块来创建所以,速度更快。
-
hadoop jar /usr/local/hadoop/lib/hadoop-lzo-0.4.14.jar com.hadoop.compression.lzo.DistributedLzoIndexer logs/201108/impression_witspixel2011080116.thin.log.lzo
8、使用hadoop streaming
需要添加参数 -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat
否则让然按单个map来执行job
Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat"
Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient.
下面的运行job的方法有问题
-
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.204.0.jar \
-
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-
-input "logs/201108/1.lzo" \
-
-mapper "/usr/local/hadoop/scripts/guid.pl'" \
-
-reducer "/usr/local/hadoop/scripts/reduse_impr.pl" \
-
-output "results/1"
注:按照上面的方法运行job,计算出来的值异常,当不使用-inputformat时结果正常,但是,只有一个map。
问题出在:com.hadoop.mapred.DeprecatedLzoTextInputFormat上。
官网上:
it adds the ability to work with Hadoop streaming via the com.apache.hadoop.mapred.DeprecatedLzoTextInputFormat class
但是指定为该class时报错,找不到该class。
查看源码:
/**
* This class conforms to the old (org.apache.hadoop.mapred.*) hadoop API style
* which is deprecated but still required in places. Streaming, for example,
* does a check that the given input format is a descendant of
* org.apache.hadoop.mapred.InputFormat, which any InputFormat-derived class
* from the new API fails. In order for streaming to work, you must use
* com.hadoop.mapred.DeprecatedLzoTextInputFormat, not
* com.hadoop.mapreduce.LzoTextInputFormat. The classes attempt to be alike in
* every other respect.
*
* Note that to use this input format properly with hadoop-streaming, you should
* also set the property stream.map.input.ignoreKey=true
. That will
* replicate the behavior of the default TextInputFormat by stripping off the byte
* offset keys from the input lines that get piped to the mapper process.
*
* See {@link LzoInputFormatCommon} for a description of the boolean property
* lzo.text.input.format.ignore.nonlzo
and how it affects the
* behavior of this input format.
*/
所以:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-D stream.map.input.ignoreKey=true \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-input "logs/201108/impression_witspixel2011083105.thin.log.lzo" \
-mapper "/usr/local/hadoop/scripts/guid.pl'" \
-reducer "/usr/local/hadoop/scripts/reduse_impr.pl" \
-output "results/201108"
注:这里使用的是CDH3版本,apache的版本未测试。
阅读(4574) | 评论(0) | 转发(0) |