依照我原来的
http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html
以及官方文档
我现在有个 CPU 密集型的 运算 ,比如时间戳 转换
# 原始文件
> xxxxx@163.com 1019 20110622230010
# 先把 文件切分成 我们需要的 map 运算数量 (根据行数)
$ split -l 500000 userid_appid_time.pplog.day.data
$vim run.pm
- #!/usr/bin/perl -an
-
chomp;
-
@F=split "\t";
-
$F[2]=~s/(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/\1-\2-\3 \4:\5:\6/g;
-
$h{$F[2]}=`date -d \"$F[2]\" +%s` if not exists $h{$F[2]};
-
print "$F[0]\t$F[1]\t$h{$F[2]}
$ 运行
hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-input hdfs:///tmp/pplog/userid_appid_time.pplog.day/x* \
-mapper run.pm \
-file /opt/sohudba/20111230/uniqname_pool/run.pm \
-output hdfs:///tmp/lky/streamingx3
$ 输出结果:
$ 帮助
- hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar -info
-
12/01/09 16:38:00 ERROR streaming.StreamJob: Missing required options: input, output
-
Usage: $HADOOP_HOME/bin/hadoop jar \
-
$HADOOP_HOME/hadoop-streaming.jar [options]
-
Options:
-
-input DFS input file(s) for the Map step
-
-output DFS output directory for the Reduce step
-
-mapper The streaming command to run
-
-combiner The streaming command to run
-
-reducer The streaming command to run
-
-file File/dir to be shipped in the Job jar file
-
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-
-partitioner JavaClassName Optional.
-
-numReduceTasks Optional.
-
-inputreader Optional.
-
-cmdenv = Optional. Pass env.var to streaming commands
-
-mapdebug Optional. To run this script when a map task fails
-
-reducedebug Optional. To run this script when a reduce task fails
-
-io Optional.
-
-verbose
-
-
Generic options supported are
-
-conf specify an application configuration file
-
-D use value for given property
-
-fs specify a namenode
-
-jt specify a job tracker
-
-files specify comma separated files to be copied to the map reduce cluster
-
-libjars specify comma separated jar files to include in the classpath.
-
-archives specify comma separated archives to be unarchived on the compute machines.
-
-
The general command line syntax is
-
bin/hadoop command [genericOptions] [commandOptions]
-
-
-
In -input: globbing on is supported and can have multiple -input
-
Default Map input format: a line is a record in UTF-8
-
the key part ends at first TAB, the rest of the line is the value
-
Custom input format: -inputformat package.MyInputFormat
-
Map output format, reduce input/output format:
-
Format defined by what the mapper command outputs. Line-oriented
-
-
The files named in the -file argument[s] end up in the
-
working directory when the mapper and reducer are run.
-
The location of this working directory is unspecified.
-
-
To set the number of reduce tasks (num. of output files):
-
-D mapred.reduce.tasks=10
-
To skip the sort/combine/shuffle/sort/reduce step:
-
Use -numReduceTasks 0
-
A Task's Map output then becomes a 'side-effect output' rather than a reduce input
-
This speeds up processing, This also feels more like "in-place" processing
-
because the input filename and the map input order are preserved
-
This equivalent -reducer NONE
-
-
To speed up the last maps:
-
-D mapred.map.tasks.speculative.execution=true
-
To speed up the last reduces:
-
-D mapred.reduce.tasks.speculative.execution=true
-
To name the job (appears in the JobTracker Web UI):
-
-D mapred.job.name='My Job'
-
To change the local temp directory:
-
-D dfs.data.dir=/tmp/dfs
-
-D stream.tmpdir=/tmp/streaming
-
Additional local temp directories with -cluster local:
-
-D mapred.local.dir=/tmp/local
-
-D mapred.system.dir=/tmp/system
-
-D mapred.temp.dir=/tmp/temp
-
To treat tasks with non-zero exit status as SUCCEDED:
-
-D stream.non.zero.exit.is.failure=false
-
Use a custom hadoopStreaming build along a standard hadoop install:
-
$HADOOP_HOME/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
-
[...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
-
For more details about jobconf parameters see:
-
-
To set an environement variable in a streaming command:
-
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
-
-
Shortcut:
-
setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar"
-
-
Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
-
-file /local/filter.pl -input "/logs/0604*/*" [...]
-
Ships a script, invokes the non-shipped perl interpreter
-
Shipped files go to the working directory so filter.pl is found by perl
-
Input files are all the daily logs for days in month 2006-04
-
-
Streaming Command Failed!
阅读(1785) | 评论(0) | 转发(0) |