HIVE : SORT BY VS ORDER BY VS DISTRIBUTE BY VS CLUSTER BY-speckle-ChinaUnix博客

木鱼

首页　| 　博文目录　| 　关于我

speckle

博客访问： 379636
博文数量： 85
博客积分： 0
博客等级：民兵
技术积分： 657
用户组：普通用户
注册时间： 2013-07-17 20:48

个人简介

行到水穷处，坐看云起时

文章分类

全部博文（85）

Hive（2）
Oracle（2）
Linux运维（4）
Spark源码分析（2）
Spark（4）
Python（2）
Scala（20）

Scala教程（16）
Shell（8）
Spring（3）
C/C++（1）
MySQL（13）

Innodb存储引擎&n（4）

Innodb（0）
Java（3）
SVN（1）
jQuery（3）
Hibernate（1）
Hadoop（3）

转载（2）

COMMON（0）

YARN（0）

HDFS（1）
Linux（12）
未分配的博文（1）

文章存档

2019年（2）

2018年（1）

2016年（1）

2015年（66）

2014年（15）

我的朋友

登高望远

相关博文

HIVE : SORT BY VS ORDER BY VS DISTRIBUTE BY VS CLUSTER BY

分类：大数据

2015-09-02 09:21:27

转自： />

In Apache Hive, It’s always a matter of confusion over how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY differs. I have compiled a set of differences between these based on attributes like how will final output look like and ordering of data in output –

SORT BY

Sort By vs Order By vs Group By vs Cluster By in HiveSORT BY

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.

Outcome : N or more sorted files with overlapping ranges.

Let’s understand with an example :-

		 
								1
							
								SELECT key, value FROM src SORT BY key ASC, value DESC

The query had 2 reducers, and the output of each is:

Reducer 1 :

Reducer 2 :

As, we can see, each reducer output is ordered but total ordering is missing , since we end up with multiple outputs per reducer.

ORDER BY

This is similar to ORDER BY in SQL Language.

In Hive, ORDER BY guarantees total ordering of data, but for that it has to be passed on to a single reducer, which is normally unacceptable and therefore in strict mode, hive makes it compulsory to use LIMIT with ORDER BY so that reducer doesn’t get overburdened.

Ordering : Total Ordered data.

Outcome : Single output i.e. fully ordered.

For example :

		 
								1
							
								SELECT key, value FROM src ORDER BY key ASC, value DESC

Reducer :

DISTRIBUTE BY

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the sameDistribute By columns will go to the same reducer.

It ensures each of N reducers gets non-overlapping ranges of column, but doesn’t sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.

Example ( taken directly from Hive wiki ):-

We are Distributing By x on the following 5 rows to 2 reducer:

Reducer 1 got

Reducer 2 got

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

CLUSTER BY

Cluster By is a short-cut for both Distribute By and Sort By.

CLUSTER BY x ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering : Global ordering between multiple reducers.

Outcome : N or more sorted files with non-overlapping ranges.

For the same example as above , if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

Reducer 2 got

x3

x4

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different.

References : –

[1]

[2]

阅读(1601) | 评论(0) | 转发(0) |

上一篇： jstat命令详解

下一篇：jsat pid not found

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6