Hadoop初探之MapReduce+HBase实例-scq2099yt-ChinaUnix博客

施昌权--淘宝卫霍shicq.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

scq2099yt

博客访问： 5862677
博文数量： 291
博客积分： 0
博客等级：民兵
技术积分： 7924
用户组：普通用户
注册时间： 2016-07-06 14:28

个人简介

阿里巴巴是个快乐的青年

文章分类

全部博文（291）

人工智能（16）

基础数学（1）

GPU优化（2）

TensorFlow（3）

机器写作（1）

深度学习（2）

机器学习（2）

机器翻译（1）

NLP算法（1）

NLP工具（1）

NLP科普（1）

AI科普（1）
IT咨询（1）

mac（1）
计算广告学（0）

计算广告学科普（0）

广告过滤（0）

反点击作弊（0）

广告推荐算法（0）

行为习惯广告（0）

广告排序算法（0）

广告匹配算法（0）

广告索引架构（0）
Web技术（10）
大数据（1）

自然语言处理（1）

深度学习（0）

机器学习（0）

社交网络（0）

数据挖掘（0）

个性化推荐（0）
团队建设（5）

产品相关（0）

运维相关（1）

测试相关（3）

敏捷开发（1）
广告技术（12）

移动广告（1）

人群定向技术（1）

广告技术科普（10）

RTB技术（0）
高性能服务器（15）

调试（2）

CPU（2）

调优（2）

监控（2）

内存（3）

并发（1）

锁（0）

IO（3）
网络通讯（4）

HTTP（1）

TCP（3）
集群（3）
移动互联网（6）

Cocos2D-HTML5（0）

Cocos2D-x（0）

Cocos2D（1）

iOS（1）

Android（4）
闲聊杂侃（11）
浏览器（3）

奇淫技巧（1）

Webkit（2）
IM即时通信（5）

OpenFire（0）

Jabber/XMPP（1）

Ejabberd（4）
云计算（0）

KVM/Xen（0）

OpenShift（0）

Cloud Found（0）

OpenStack（0）
分布式（0）

Two-Phase C（0）

MVCC（0）

Lease（0）

Quorum（0）

CAP（0）

Consistent （0）

Gossip（0）

Paxos（0）
Hadoop系列（16）

MapReduce（2）

Hive（0）

Zookeeper（0）

HDFS（3）

Hadoop（9）

HBase（2）
Amazon系列（1）

EC2（0）

AWS（0）

S3（0）

Dynamo（1）
Google系列（1）

GAE（0）

ProtoBuffer（1）

BigTable（0）

Chubby（0）

MapReduce（0）

GFS（0）
架构框架（0）

Node.js（0）

Avro（0）

Shrift（0）
算法与数据结构（27）

字符串（8）

排序（7）

其它（1）

查找（1）

链表（5）

树（5）
编程语言（77）

C++11（1）

JavaScript（1）

JSP（0）

HTML（1）

Java（9）

Shell（15）

Python（11）

Golang（7）

Erlang（6）

PHP（0）

Lua（7）

C++（17）

C（2）
搜索引擎（7）

seo（1）

Nutch（0）

垂直搜索引擎（1）

解密搜索引擎技术（5）

Solr（0）

Sphinx（0）

Lucene（0）
我的开源项目（0）
开源代码解析（69）

Log4cpp（0）

Tomcat（3）

Storm（1）

LevelDB（0）

Apache（1）

fastDFS（0）

HyperTable（0）

Keepalived（1）

LVS（1）

Linux（18）

Varnish（0）

Squid（0）

Heartbeat（0）

Libevent（1）

Nginx（17）

Haproxy（2）

HandleSocket（0）

neo4j（0）

MongoDB（2）

Memcached（3）

Redis（6）

MySQL（4）

RabbitMQ（9）

ZeroMQ（0）
未分配的博文（1）

文章存档

2018年（21）

2017年（4）

2016年（5）

2015年（17）

2014年（68）

2013年（174）

2012年（2）

我的朋友

相关博文

Hadoop初探之MapReduce+HBase实例

分类： HADOOP

2014-03-04 19:38:12

一、环境配置
      这里选择的环境是hadoop-0.20.2和hbase-0.90.4，Hadoop环境配置参看这里，HBase环境配置请看这里。
      需要注意的是，本文的需求是在Hadoop上跑MapReduce job来分析日志并将结果持久化到HBase，所以，在编译程序时，Hadoop需要用到HBase和Zookeeper包，因此，需要分别将hbase-0.90.4.jar和zookeeper-3.3.2.jar拷贝到Hadoop的lib目录下，具体操作如下：
      #cp /root/hbase-0.90.4/hbase-0.90.4.jar /root/hadoop-0.20.2/lib
      #cp /root/hbase-0.90.4/lib/zookeeper-3.3.2.jar /root/hadoop-0.20.2/lib

二、实例编写
      日志文件xxxlog.txt的内容如下：
      version-------------time-----------------id-------rt----filter--------id----rt-----filter
        1.0^A2014-03-03 00:00:01^Ad2000^C4^C3040^Bd2001^C7^C0
        1.0^A2014-03-03 00:00:01^Ad3000^C4^C3041^Bd2001^C7^C0
      同样，需要将此文件放到hdfs目录下，比如：hadoop fs -put /tmp/input。
      为持久化在HBase中创建table和family，比如：./hbase shell，create 'xxxlog', 'dsp_filter'。
      为了清晰便于扩展，将Maper、Reducer、Driver分开，具体如下：
1、Maper
      #vi xxxLogMaper.java
        import java.io.IOException;
        import org.apache.hadoop.mapreduce.Mapper;
        import org.apache.hadoop.io.IntWritable;
        import org.apache.hadoop.io.Text;
        public class xxxLogMaper
        extends Mapper {
        public final static String CONTROL_A = "^A";
        public final static String CONTROL_B = "^B";
        public final static String CONTROL_C = "^C";
        public final static int PV_TIME = 1;
        public final static int DSP_INFO_LIST = 5;
        public final static int DSP_ID = 0;
        public final static int DSP_FILTER = 2;
        public void map(Object key, Text value, Context context) {
        try {
        System.out.println("\n------------map come on-----------");
        System.out.println("\nline=-----------"+value.toString());
        String[] line = value.toString().split(CONTROL_A);
        String pvtime = "";
        System.out.println("\npvtime=-----------"+line[PV_TIME]);
        String year = line[PV_TIME].substring(0, 4);
        String month = line[PV_TIME].substring(5, 7);
        String day = line[PV_TIME].substring(8, 10);
        String hour = line[PV_TIME].substring(11, 13);
        String minute = "";
        int m_tmp = Integer.parseInt(line[PV_TIME].substring(14, 16));
        if (m_tmp >= 0 && m_tmp <= 30) {
        minute = "00";
        } else {
        minute = "30";
        }
        pvtime = year + month + day + hour + minute;
        String[] dspInfoList = line[DSP_INFO_LIST].split(CONTROL_B);
        String dspid = "";
        String dspfilter = "";
        Text k = new Text();
        IntWritable v = new IntWritable(1);
        for(int i=0; i         System.out.println("\n------------map-----------");
        System.out.println("\ndspinfo="+dspInfoList[i]);
        String[] dspInfo = dspInfoList[i].split(CONTROL_C);
        dspid = dspInfo[DSP_ID];
        dspfilter = dspInfo[DSP_FILTER];
        //key=ddspid^Afilter^Apvtime, value=1
        k.set(dspid+CONTROL_A+dspfilter+CONTROL_A+pvtime);
        context.write(k, v);
        System.out.println("\nkey="+k.toString());
        System.out.println("\nvalue="+v.toString());
        }
        } catch (IOException e) {
        e.printStackTrace();
        } catch (InterruptedException e) {
        e.printStackTrace();
        }
          }
        }
2、Reducer
        import java.io.IOException;
        import org.apache.hadoop.hbase.client.Get;
        import org.apache.hadoop.hbase.client.Put;
        import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
        import org.apache.hadoop.hbase.mapreduce.TableReducer;
        import org.apache.hadoop.mapreduce.Reducer;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.io.IntWritable;
        import org.apache.hadoop.hbase.HBaseConfiguration;
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.hbase.client.HTable;
        import org.apache.hadoop.hbase.client.HTablePool;
        import org.apache.hadoop.hbase.client.Result;
        import org.apache.hadoop.hbase.util.Bytes;
        import org.apache.hadoop.hbase.KeyValue;
        public class BidLogReducer
        extends TableReducer {
        public final static String COL_FAMILY = "dsp_filter";
        public final static String COL_NAME = "sum";
        private final static String ZK_HOST = "localhost";
        private final static String TABLE_NAME = "xxxlog";
        public void reduce(Text key, Iterable values, Context context)
        throws IOException, InterruptedException {
        System.out.println("\n------------reduce come on-----------");
        String k = key.toString();
        IntWritable v = new IntWritable();
        int sum = 0;
        for (IntWritable val:values) {
        sum += val.get();
        }
        //v.set(sum);
        //context.write(key, v);
        System.out.println("\n------------reduce-----------");
        System.out.println("\ncur-key="+key.toString());
        System.out.println("\ncur-value="+sum);
          Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum.", ZK_HOST);
        HTablePool pool = new HTablePool(conf, 3);
        HTable table = (HTable)pool.getTable(TABLE_NAME);
        Get getrow = new Get(k.getBytes());
        Result r = table.get(getrow);
        int m_tmp = 0;
        for(KeyValue kv:r.raw()) {
        System.out.println("\nraw-KeyValugge---"+kv);
        System.out.println("\nraw-row=>"+Bytes.toString(kv.getRow()));
        System.out.println("\nraw-family=>"+Bytes.toString(kv.getFamily()));
        System.out.println("\nraw-qualifier=>"+Bytes.toString(kv.getQualifier()));
        System.out.println("\nraw-value=>"+Bytes.toString(kv.getValue()));
        m_tmp += Integer.parseInt(Bytes.toString(kv.getValue()));
        }
        sum = sum + m_tmp;
        v.set(sum);
        System.out.println("\nreal-key="+key.toString());
        System.out.println("\nreal-value="+v.toString());
        Put putrow = new Put(k.getBytes());
        putrow.add(COL_FAMILY.getBytes(), COL_NAME.getBytes(), String.valueOf(v).getBytes());
        try {
        context.write(new ImmutableBytesWritable(key.getBytes()), putrow);
        } catch (IOException e) {
        e.printStackTrace();
        } catch (InterruptedException e) {
        e.printStackTrace();
          }
        }
        }
3、Driver
        #vi xxxLogDriver.java
      #vi xxxLogReducer.java
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.hbase.HBaseConfiguration;
        import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.io.IntWritable;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
        import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
        import org.apache.hadoop.util.GenericOptionsParser;
        public class xxxLogDriver {
        public final static String ZK_HOST = "localhost";
        public final static String TABLE_NAME = "xxxlog";
        public static void main(String[] args) throws Exception {
        //Hbase Configuration
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum.", ZK_HOST);
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
        System.err.println("Usage: please input args");
        System.exit(2);
        }
          Job job = new Job(conf,"xxxLog");
          job.setJarByClass(xxxLogDriver.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.out.println("\n------------driver come on-----------");
          job.setMapperClass(xxxLogMaper.class);
          job.setReducerClass(xxxLogReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        TableMapReduceUtil.initTableReducerJob(TABLE_NAME, xxxLogReducer.class, job);
        System.exit(job.waitForCompletion(true)? 0 : 1);
        }
        }

三、编译运行
      在当前目录下编译源码，具体如下：
      #javac -classpath /root/hadoop-0.20.2/hadoop-0.20.2-core.jar:/root/hadoop-0.20.2/lib/commons-cli-1.2.jar:/root/hbase-0.90.4/hbase-0.90.4.jar -d ./ xxxLogMaper.java xxxLogReducer.java xxxLogDriver.java
      需要注意的是，必须三个一起编译否则出错：
      xxxLogDriver.java:22: error: cannot find symbol
        job.setMapperClass(xxxLogMaper.class);
      打包class文件，具体如下：
      #jar cvf xxxLog.jar *class
      #rm -rf *class
      运行任务，具体如下：
      #hadoop jar xxxLog.jar xxxLogDriver /tmp/input /tmp/output
      查询结果，具体如下：
      #./hbase shell
        hbase(main):014:0>scan 'xxxlog'

关于更多Hadoop和HBase的信息请参见这里，，这里，这里，，还有这里。

阅读(9210) | 评论(1) | 转发(2) |

上一篇：HBase初探之安装与配置

下一篇：Tomcat初探之安装与配置

给主人留下些什么吧！~~

scq2099yt2014-03-04 19:38:30

文明上网，理性发言...

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6