Chinaunix首页 | 论坛 | 博客
  • 博客访问: 146731
  • 博文数量: 28
  • 博客积分: 1646
  • 博客等级: 上尉
  • 技术积分: 405
  • 用 户 组: 普通用户
  • 注册时间: 2007-03-12 14:28
文章分类

全部博文(28)

文章存档

2013年(28)

我的朋友

分类: HADOOP

2013-03-27 14:52:15

Pattern Name

Numerical Summarizations

Category

Summarization Patterns

Intent

Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set.

Motivation

Get top-level view of your data

Applicability

Numerical summarizations should be used when both of the following are true:

?   You are dealing with numerical data or counting

?   The data can be grouped by specific fields.

Structure

?   The mapper outputs keys that consist of each field to group by, and values consisting of any pertinent numerical items.

?   if you can arbitrarily change the order of the values and you can group the computation arbitrarily, you can use a combiner here.

?   Numerical summaries can benefit from a custom partitioner to better distribute key/value pairs across n number of reduce tasks

?   The reducer receives a set of numerical values (v1, v2, v3, …, vn) associated with a group-by key records to perform the function λ = θ(v1, v2, v3, …, vn). The value of λ is output with the given input key.

Consequences

The output of the job will be a set of part files containing a single record per reducer input group. Each record will consist of the key and all aggregate values.

Known uses

?   Word count

?   Record count

?   Min/Max/Count

?   Average/Median/Standard deviation

Resemblances

SQL:

SELECT MIN(numericalcol1), MAX(numericalcol1),

COUNT(*) FROM table GROUP BY groupcol2;

Pig:

b = GROUP a BY groupcol2;

c = FOREACH b GENERATE group, MIN(a.numericalcol1),

MAX(a.numericalcol1), COUNT_STAR(a);

Performance analysis

Aggregations performed by jobs using this pattern typically perform well when the combiner is properly used.

developers need to be concerned about the appropriate number of reducers and take into account any data skew that may be present in the reduce groups.

Numerical Summarization Examples

Minimum, Maximum, and count example

Average example

Median and standard deviation

Memory-conscious median and standard deviation

 

?    

阅读(1805) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~