hadoop stream perl-liukaiyi-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

liukaiyiskynet.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

博客访问： 667986
博文数量： 149
博客积分： 3901
博客等级：中校
技术积分： 1558
用户组：普通用户
注册时间： 2009-02-16 14:33

文章分类

全部博文（149）

数据挖掘组（5）
linux（2）
基础知识（2）
读书（8）

mbalib:营销分析（1）

《引爆流行》（1）

《长尾理论》（1）

《云计算》（4）
工具服务器（15）

wiki - trac（1）

版本控制器（5）

消息队列（0）

gearman（3）

同步/备份（2）

调度系统（2）

nginx（1）
编辑器（9）

vim（9）
产品的智慧（9）

提纲（0）
数据分析（11）

load（0）

cleansing（2）

transform（0）

extract（0）

算法（1）

数据结构（0）
语言（28）

shell（1）

R（5）

english（0）

c（0）

javascript（0）

perl（7）

python（9）
数据存储（57）

postgres（7）

hadoop（29）

voldemort（2）

cassandra（1）

infobright（2）

mysql（9）

mongodb（3）
未分配的博文（3）

文章存档

2014年（2）

2013年（10）

2012年（32）

2011年（21）

2010年（84）

我的朋友

最近访客

推荐博文

相关博文

hadoop stream perl

分类：数据库开发技术

2012-01-09 16:31:30

依照我原来的
http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html
以及官方文档

我现在有个 CPU 密集型的运算，比如时间戳转换
# 原始文件
> xxxxx@163.com 1019 20110622230010

# 先把文件切分成我们需要的 map 运算数量 (根据行数)
$ split -l 500000 userid_appid_time.pplog.day.data

$vim run.pm

#!/usr/bin/perl -an
chomp;
@F=split "\t";
$F[2]=~s/(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/\1-\2-\3 \4:\5:\6/g;
$h{$F[2]}=`date -d \"$F[2]\" +%s` if not exists $h{$F[2]};
print "$F[0]\t$F[1]\t$h{$F[2]}

$ 运行
hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-input hdfs:///tmp/pplog/userid_appid_time.pplog.day/x* \
-mapper run.pm \
-file /opt/sohudba/20111230/uniqname_pool/run.pm \
-output hdfs:///tmp/lky/streamingx3

$ 输出结果：

$ 帮助

hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar -info
12/01/09 16:38:00 ERROR streaming.StreamJob: Missing required options: input, output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input DFS input file(s) for the Map step
-output DFS output directory for the Reduce step
-mapper The streaming command to run
-combiner The streaming command to run
-reducer The streaming command to run
-file File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks Optional.
-inputreader Optional.
-cmdenv = Optional. Pass env.var to streaming commands
-mapdebug Optional. To run this script when a map task fails
-reducedebug Optional. To run this script when a reduce task fails
-io Optional.
-verbose
Generic options supported are
-conf specify an application configuration file
-D use value for given property
-fs specify a namenode
-jt specify a job tracker
-files specify comma separated files to be copied to the map reduce cluster
-libjars specify comma separated jar files to include in the classpath.
-archives specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
In -input: globbing on is supported and can have multiple -input
Default Map input format: a line is a record in UTF-8
the key part ends at first TAB, the rest of the line is the value
Custom input format: -inputformat package.MyInputFormat
Map output format, reduce input/output format:
Format defined by what the mapper command outputs. Line-oriented
The files named in the -file argument[s] end up in the
working directory when the mapper and reducer are run.
The location of this working directory is unspecified.
To set the number of reduce tasks (num. of output files):
-D mapred.reduce.tasks=10
To skip the sort/combine/shuffle/sort/reduce step:
Use -numReduceTasks 0
A Task's Map output then becomes a 'side-effect output' rather than a reduce input
This speeds up processing, This also feels more like "in-place" processing
because the input filename and the map input order are preserved
This equivalent -reducer NONE
To speed up the last maps:
-D mapred.map.tasks.speculative.execution=true
To speed up the last reduces:
-D mapred.reduce.tasks.speculative.execution=true
To name the job (appears in the JobTracker Web UI):
-D mapred.job.name='My Job'
To change the local temp directory:
-D dfs.data.dir=/tmp/dfs
-D stream.tmpdir=/tmp/streaming
Additional local temp directories with -cluster local:
-D mapred.local.dir=/tmp/local
-D mapred.system.dir=/tmp/system
-D mapred.temp.dir=/tmp/temp
To treat tasks with non-zero exit status as SUCCEDED:
-D stream.non.zero.exit.is.failure=false
Use a custom hadoopStreaming build along a standard hadoop install:
$HADOOP_HOME/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
[...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
For more details about jobconf parameters see:
To set an environement variable in a streaming command:
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
Shortcut:
setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar"
Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
-file /local/filter.pl -input "/logs/0604*/*" [...]
Ships a script, invokes the non-shipped perl interpreter
Shipped files go to the working directory so filter.pl is found by perl
Input files are all the daily logs for days in month 2006-04
Streaming Command Failed!

阅读(1808) | 评论(0) | 转发(0) |

0

上一篇：HDFS FSNameSystem

下一篇：erl 简单尝试 - Location Transparency

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们