hive优化原则-levy-linux-ChinaUnix博客

又是新的一天

首页　| 　博文目录　| 　关于我

levy-linux

博客访问： 1231177
博文数量： 259
博客积分： 10
博客等级：民兵
技术积分： 2518
用户组：普通用户
注册时间： 2012-10-13 16:12

个人简介

科技改变世界，技术改变人生。

文章分类

全部博文（259）

spark（3）
Ubuntu（3）
Flume（1）
Zookeeper（1）
机器学习（5）
python（11）
CDH（3）
ambari（10）
storm（4）
kafka（3）
Redis（5）
ganglia（4）
Hive（12）
IT知识（1）
Hbase（7）
java（8）
nagios（3）
服务器管理（2）
自我修养（6）
hadoop（55）
MSSQL（4）
HPUX（2）
中间件（1）
windows（18）
虚拟机（6）
linux（49）
Mysql（5）
Oracle（26）
未分配的博文（1）

相关博文

hive优化原则

分类： HADOOP

2015-07-17 16:23:42

转载：http://blog.sina.com.cn/s/blog_9f48885501017cq8.html

使用过hive一段时间，发现楼主讲的非常正确。

基本原则：

1：尽量尽早地过滤数据，减少每个阶段的数据量,对于分区表要加分区，同时只选择需要使用到的字段

select... from A

joinB

on A.key= B.key

whereA.userid>10

andB.userid<10

and A.dt='20120417'

and B.dt='20120417';

应该改写为：

select.... from (select .... from A

wheredt='201200417'

and userid>10

) a

join (select .... from B

wheredt='201200417'

and userid <10

on a.key= b.key;

2：尽量原子化操作，尽量避免一个SQL包含复杂逻辑

可以使用中间表来完成复杂的逻辑

droptable if exists tmp_table_1;

createtable if not exists tmp_table_1 as

select......;

droptable if exists tmp_table_2;

createtable if not exists tmp_table_2 as

select......;

droptable if exists result_table;

createtable if not exists result_table as

select......;

droptable if exists tmp_table_1;

droptable if exists tmp_table_2;

3：单个SQL所起的JOB个数尽量控制在5个以下

4：慎重使用mapjoin,一般行数小于2000行，大小小于1M(扩容后可以适当放大)的表才能使用,小表要注意放在join的左边（目前TCL里面很多都小表放在join的右边）。

否则会引起磁盘和内存的大量消耗

5：写SQL要先了解数据本身的特点，如果有join ,group操作的话，要注意是否会有数据倾斜

如果出现数据倾斜，应当做如下处理：

sethive.exec.reducers.max=200;

setmapred.reduce.tasks= 200;---增大Reduce个数

sethive.groupby.mapaggr.checkinterval=100000;--这个是group的键对应的记录条数超过这个值则会进行分拆,值根据具体数据量设置

sethive.groupby.skewindata=true; --如果是group by过程出现倾斜应该设置为true

sethive.skewjoin.key=100000;--这个是join的键对应的记录条数超过这个值则会进行分拆,值根据具体数据量设置

sethive.optimize.skewjoin=true;--如果是join 过程出现倾斜应该设置为true

6：如果union all的部分个数大于2，或者每个union部分数据量大，应该拆成多个insertinto 语句，实际测试过程中，执行时间能提升50%

insertoverwite table tablename partition (dt= ....)

select..... from (

select... from A

unionall

select... from B

union all

select... from C

) R

where...;

可以改写为：

insertinto table tablename partition (dt= ....)

select.... from A

WHERE...;

insertinto table tablename partition (dt= ....)

select.... from B

WHERE...;

insertinto table tablename partition (dt= ....)

select.... from C

WHERE...;

阅读(2443) | 评论(0) | 转发(0) |

上一篇：Hive实现wordCount程序

下一篇：hive优化思路

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6