Hadoop入门经典: WordCount程序-levy-linux-ChinaUnix博客

又是新的一天

首页　| 　博文目录　| 　关于我

levy-linux

博客访问： 1234126
博文数量： 259
博客积分： 10
博客等级：民兵
技术积分： 2518
用户组：普通用户
注册时间： 2012-10-13 16:12

个人简介

科技改变世界，技术改变人生。

文章分类

全部博文（259）

spark（3）
Ubuntu（3）
Flume（1）
Zookeeper（1）
机器学习（5）
python（11）
CDH（3）
ambari（10）
storm（4）
kafka（3）
Redis（5）
ganglia（4）
Hive（12）
IT知识（1）
Hbase（7）
java（8）
nagios（3）
服务器管理（2）
自我修养（6）
hadoop（55）
MSSQL（4）
HPUX（2）
中间件（1）
windows（18）
虚拟机（6）
linux（49）
Mysql（5）
Oracle（26）
未分配的博文（1）

相关博文

Hadoop入门经典: WordCount程序

分类： HADOOP

2015-07-14 17:12:24

WordCount程序在 hadoop1.2.1 测试成功。

点击(此处)折叠或打开

package hadoopdemo.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens()) {
word.set(token.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

将以上内容打个jar包运行命令
hadoop jar /usr/local/wordcount.jar WordCount /input /output

1、WordCountMap类继承了org.apache.hadoop.mapreduce.Mapper，4个泛型类型分别是map函数输入key的类型，输入value的类型，输出key的类型，输出value的类型。

2、WordCountReduce类继承了org.apache.hadoop.mapreduce.Reducer，4个泛型类型含义与map类相同。

3、map的输出类型与reduce的输入类型相同，而一般情况下，map的输出类型与reduce的输出类型相同，因此，reduce的输入类型与输出类型相同。

4、hadoop根据以下代码确定输入内容的格式：
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat是hadoop默认的输入方法，它继承自FileInputFormat。在TextInputFormat中，它将数据集切割成小数据集InputSplit，每一个InputSplit由一个mapper处理。此外，InputFormat还提供了一个RecordReader的实现，将一个InputSplit解析成<key,value>的形式，并提供给map函数：
key：这个数据相对于数据分片中的字节偏移量，数据类型是LongWritable。
value：每行数据的内容，类型是Text。
因此，在本例中，map函数的key/value类型是LongWritable与Text。

5、Hadoop根据以下代码确定输出内容的格式：
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat是hadoop默认的输出格式，它会将每条记录一行的形式存入文本文件，如
the 30
happy 23

阅读(2218) | 评论(0) | 转发(0) |

上一篇：远程桌面之终端服务器超出了最大允许连接数

下一篇：使用Maven构建hadoop项目

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6