Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1220557
  • 博文数量: 259
  • 博客积分: 10
  • 博客等级: 民兵
  • 技术积分: 2518
  • 用 户 组: 普通用户
  • 注册时间: 2012-10-13 16:12
个人简介

科技改变世界,技术改变人生。

文章分类

全部博文(259)

分类: HADOOP

2015-07-14 17:12:24

WordCount程序 在 hadoop1.2.1 测试成功。


点击(此处)折叠或打开

  1. package hadoopdemo.wordcount;
  2.       
  3.     import java.io.IOException;
  4.     import java.util.StringTokenizer;
  5.       
  6.     import org.apache.hadoop.conf.Configuration;
  7.     import org.apache.hadoop.fs.Path;
  8.     import org.apache.hadoop.io.IntWritable;
  9.     import org.apache.hadoop.io.LongWritable;
  10.     import org.apache.hadoop.io.Text;
  11.     import org.apache.hadoop.mapreduce.Job;
  12.     import org.apache.hadoop.mapreduce.Mapper;
  13.     import org.apache.hadoop.mapreduce.Reducer;
  14.     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  15.     import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  16.     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  17.     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  18.       
  19.     public class WordCount {
  20.       
  21.         public static class Map extends
  22.                 Mapper<LongWritable, Text, Text, IntWritable> {
  23.       
  24.             private final IntWritable one = new IntWritable(1);
  25.             private Text word = new Text();
  26.       
  27.             public void map(LongWritable key, Text value, Context context)
  28.                     throws IOException, InterruptedException {
  29.                 String line = value.toString();
  30.                 StringTokenizer token = new StringTokenizer(line);
  31.                 while (token.hasMoreTokens()) {
  32.                     word.set(token.nextToken());
  33.                     context.write(word, one);
  34.                 }
  35.             }
  36.         }
  37.       
  38.         public static class Reduce extends
  39.                 Reducer<Text, IntWritable, Text, IntWritable> {
  40.       
  41.             public void reduce(Text key, Iterable<IntWritable> values,
  42.                     Context context) throws IOException, InterruptedException {
  43.                 int sum = 0;
  44.                 for (IntWritable val : values) {
  45.                     sum += val.get();
  46.                 }
  47.                 context.write(key, new IntWritable(sum));
  48.             }
  49.         }
  50.       
  51.         public static void main(String[] args) throws Exception {
  52.             Configuration conf = new Configuration();
  53.             Job job = new Job(conf);
  54.             job.setJarByClass(WordCount.class);
  55.             job.setJobName("wordcount");
  56.       
  57.             job.setOutputKeyClass(Text.class);
  58.             job.setOutputValueClass(IntWritable.class);
  59.       
  60.             job.setMapperClass(Map.class);
  61.             job.setReducerClass(Reduce.class);
  62.       
  63.             job.setInputFormatClass(TextInputFormat.class);
  64.             job.setOutputFormatClass(TextOutputFormat.class);
  65.       
  66.             FileInputFormat.addInputPath(job, new Path(args[0]));
  67.             FileOutputFormat.setOutputPath(job, new Path(args[1]));
  68.       
  69.             job.waitForCompletion(true);
  70.         }
  71.     }
将以上内容打个jar包运行命令
hadoop jar /usr/local/wordcount.jar WordCount /input /output

1、WordCountMap类继承了org.apache.hadoop.mapreduce.Mapper,4个泛型类型分别是map函数输入key的类型,输入value的类型,输出key的类型,输出value的类型。

2、WordCountReduce类继承了org.apache.hadoop.mapreduce.Reducer,4个泛型类型含义与map类相同。

3、map的输出类型与reduce的输入类型相同,而一般情况下,map的输出类型与reduce的输出类型相同,因此,reduce的输入类型与输出类型相同。

4、hadoop根据以下代码确定输入内容的格式:
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat是hadoop默认的输入方法,它继承自FileInputFormat。在TextInputFormat中,它将数据集切割成小数据集InputSplit,每一个InputSplit由一个mapper处理。此外,InputFormat还提供了一个RecordReader的实现,将一个InputSplit解析成<key,value>的形式,并提供给map函数:
key:这个数据相对于数据分片中的字节偏移量,数据类型是LongWritable。
value:每行数据的内容,类型是Text。
因此,在本例中,map函数的key/value类型是LongWritable与Text。

5、Hadoop根据以下代码确定输出内容的格式:
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat是hadoop默认的输出格式,它会将每条记录一行的形式存入文本文件,如
the 30
happy 23

阅读(2173) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~