Hadoop Streaming-xjc2694-ChinaUnix博客

Xiajc - 工作笔记xjc2694.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

xjc2694

博客访问： 3081059
博文数量： 535
博客积分： 15788
博客等级：上将
技术积分： 6507
用户组：普通用户
注册时间： 2007-03-07 09:11

文章分类

全部博文（535）

Puppet（6）
Solaris（1）
hadoop（15）
虚拟化（8）
C（1）
DB（44）
perl（35）
云计算（27）
系统监控（26）
Others（27）
WWW（100）
Mail（20）
Linux（213）
未分配的博文（12）

文章存档

2016年（1）

2015年（1）

2014年（10）

2013年（26）

2012年（43）

2011年（86）

2010年（76）

2009年（136）

2008年（97）

2007年（59）

我的朋友

相关博文

Hadoop Streaming

分类：云计算

2011-10-27 10:32:48

参考

http://hadoop.apache.org/common/docs/current/streaming.html

1、

Hadoop Streamimg是随Hadoop发布的一个编程工具，允许使用任何可执行文件或脚本创建和运行map/reduce job。

例如：最简单的

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /bin/wc

2、Streaming如何工作

在上面的例子中，mapper和reducer从标准输入逐行的读入输入。处理后发送到标准输出。Streaming将创建map/reducejob，提交job到cluster，并监视job的执行过程。

当一个可执行文件或脚本作为mappers，当mapper初始化时，每个mapper task将该可执行文件或脚本作为一个独立的进程运行。当mapper task运行时，转换输入为行，并将该行提供给进程作为标准输入。在此期间，mapper收集可执行文件或脚本的标准输出，并把每一行内容转换为key/value对，作为mapper的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包含tab）作为value。如果没有tab，正行作为key值，value值为null。key值也可以通过自定义，将在稍后提到。

当一个可执行文件或脚本作为reducers，当reducer初始化时，每个reducer task将该可执行文件或脚本作为一个独立的进程运行。当reducer task运行时，转换输入的key/values对为行并提供给进程作为标准输入。在此期间，reducer收集可执行文件或脚本的标准输出，并将每一行内容转换为key/values对，作为reducer的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包含tab）作为value。如果没有tab，正行作为key值，value值为null。key值也可以通过自定义，将在稍后提到。

以上是Map/Reduce框架

可以只map而不reduce

Specifying Map-Only Jobs

Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

-D mapred.reduce.tasks=0

To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0".

options：

（1）-input：输入文件路径

（2）-output：输出文件路径

（3）-mapper：用户自己写的mapper程序，可以是可执行文件或者脚本

（4）-reducer：用户自己写的reducer程序，可以是可执行文件或者脚本

（5）-file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。

（6）-partitioner：用户自定义的partitioner程序

（7）-combiner：用户自定义的combiner程序（必须用java实现）

（8）-D：作业的一些属性（以前用的是-jonconf），具体有：

注意：-D参数必须作为第一个参数，有多个参数需要调整时，写多个-D

1）mapred.map.tasks：map task数目

2）mapred.reduce.tasks：reduce task数目

3）stream.map.input.field.separator/stream.map.output.field.separator： map task输入/输出数

据的分隔符,默认均为\t。

4）stream.num.map.output.key.fields：指定map task输出记录中key所占的域数目(即，使用几个域用于排序。如果为2，则第一和第二个域整体作为可以，参与排序。似乎不能单独指定第二个域作为key)

5）stream.reduce.input.field.separator/stream.reduce.output.field.separator：reduce task输入/输出数据的分隔符，默认均为\t。

6）stream.num.reduce.output.key.fields：指定reduce task输出记录中key所占的域数目

-D Use value for given property.

其中的property是任何可以在core，hdfs，maprd配置文件里写的属性，都可以在这里传递。

-D参数可以广泛用于hadoop的命令中，例如dfs，详细查看：http://hadoop.apache.org/common/docs/current/commands_manual.html

例如：在上传文件时，指定拷贝的副本

hadoop dfs -D dfs.replication=10 -put 70M logs/2

阅读(2355) | 评论(0) | 转发(0) |

上一篇：hadoop 添加删除datanode及tasktracker

下一篇：关于hadoop的dfs.replication

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6