2013年(28)
分类: HADOOP
2013-03-27 14:52:15
Pattern Name |
Numerical Summarizations |
Category |
Summarization Patterns |
Intent |
Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. |
Motivation |
Get top-level view of your data |
Applicability |
Numerical summarizations should be used when both of the following are true: ? You are dealing with numerical data or counting ? The data can be grouped by specific fields. |
Structure |
? The mapper outputs keys that consist of each field to group by, and values consisting of any pertinent numerical items. ? if you can arbitrarily change the order of the values and you can group the computation arbitrarily, you can use a combiner here. ? Numerical summaries can benefit from a custom partitioner to better distribute key/value pairs across n number of reduce tasks ? The reducer receives a set of numerical values (v1, v2, v3, …, vn) associated with a group-by key records to perform the function λ = θ(v1, v2, v3, …, vn). The value of λ is output with the given input key. |
Consequences |
The output of the job will be a set of part files containing a single record per reducer input group. Each record will consist of the key and all aggregate values. |
Known uses |
? Word count ? Record count ? Min/Max/Count ? Average/Median/Standard deviation |
Resemblances |
SQL: SELECT MIN(numericalcol1), MAX(numericalcol1), COUNT(*) FROM table GROUP BY groupcol2; Pig: b = GROUP a BY groupcol2; c = FOREACH b GENERATE group, MIN(a.numericalcol1), MAX(a.numericalcol1), COUNT_STAR(a); |
Performance analysis |
Aggregations performed by jobs using this pattern typically perform well when the combiner is properly used. developers need to be concerned about the appropriate number of reducers and take into account any data skew that may be present in the reduce groups. |
Numerical Summarization Examples |
Minimum, Maximum, and count example Average example Median and standard deviation Memory-conscious median and standard deviation |
|
? |