日志分析 mapreduce sogou-jiangwen127-ChinaUnix博客

EricLiseo2register.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

jiangwen127

博客访问： 2495420
博文数量： 392
博客积分： 7040
博客等级：少将
技术积分： 4138
用户组：普通用户
注册时间： 2009-06-17 13:03

个人简介

范德萨发而为

文章分类

全部博文（392）

nosql（1）
c/c++（7）
machine lea（67）
设计模式（1）
web架构（35）
关系型database（23）
distributed（11）
fuckingwindows（1）
SE（24）
life（9）
berkeleyDB（4）
beauty of math（3）
Java_study（11）
algorithm（77）
kernel（16）
hadoop（13）
programming（8）
network（9）
linux operation（14）
bash（12）
reading（5）
STL using（8）
intern（0）
job_hunter（29）
未分配的博文（4）

文章存档

2017年（5）

2016年（19）

2015年（34）

2014年（14）

2013年（47）

2012年（40）

2011年（51）

2010年（137）

2009年（45）

我的朋友

相关博文

日志分析 mapreduce sogou

分类：服务器与存储

2010-03-08 19:39:50

数据来源：SogouQ
统计信息：对每个查询中的查询词的数目进行统计

代码如下：

package Sogou; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class SogouQueryWordCountClassifyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { IntWritable one = new IntWritable(1); String line = values.toString(); /* c#a表示统计查询的总行数 */ String outline = "ca::"; output.collect(new Text(outline), one); /* 按查询词的个数进行分类 */ String[] words = line.split("\\+"); int length = words.length; if (0 == length) { outline = "c0::"; /* 0个查询词 */ } else if (1 == length) { outline = "c1::"; /* 1个查询词 */ } else if (2 == length) { outline = "c2::"; /* 2个查询词 */ } else if (3 == length) { outline = "c3::"; /* 3个查询词 */ } else { outline = "c4::"; /* 4个及以上查询词 */ } output.collect(new Text(outline), one); /* map输出，用于reduce计数 */ } }

package Sogou; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class PartitionerClass implements Partitioner<Text, IntWritable> { public int getPartition(Text key, IntWritable values, int numPartitions) { /* 根据map的输出来区别不同的统计信息 */ if (numPartitions >= 6) { if (key.toString().startsWith("ca::")) { return 1; } else if (key.toString().startsWith("c0::")) { return 2; } else if (key.toString().startsWith("c1::")) { return 3; } else if (key.toString().startsWith("c2::")) { return 4; } else if (key.toString().startsWith("c3::")) { return 5; } else if (key.toString().startsWith("c4::")) { return 6; } else { return 7; } } else { return 0; } } public void configure(JobConf job) {} }

package Sogou; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class SogouQueryWordCountClassify { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(Sogou.SogouQueryWordCountClassify.class); conf.setJobName("SogouQueryWordCountClassify"); // specify output types conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); // specify input and output DIRECTORIES (not files) FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); // specify a mapper conf.setMapperClass(Sogou.SogouQueryWordCountClassifyMapper.class); // specify a reducer conf.setReducerClass(Sogou.SogouQueryWordCountClassifyReducer.class); // specify a partitioner conf.setPartitionerClass(Sogou.PartitionerClass.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } }

统计结果
c1::    19266013   //1个查询词的查询次数
c2::    1621804    //2个查询词的查询次数
c3::    364414
c4::    174710
ca::    21426941   //总的查询次数

阅读(3056) | 评论(0) | 转发(0) |

上一篇：Bash shell中的位置参数$#,$*,$@,$0,$1,$2...及特殊参数$?,$-等的含义

下一篇：GOF设计模式之strategy模式

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6