Chinaunix首页 | 论坛 | 博客
  • 博客访问: 146067
  • 博文数量: 28
  • 博客积分: 1646
  • 博客等级: 上尉
  • 技术积分: 405
  • 用 户 组: 普通用户
  • 注册时间: 2007-03-12 14:28
文章分类

全部博文(28)

文章存档

2013年(28)

我的朋友

分类: HADOOP

2013-04-10 08:49:28

Pattern Name

Partition Pruning

Category

Input and Output Patterns

Description

Partition Prunning configures the way the framework pick input splits and drop files from being loaded into MapReduce based on the name of the file.

Intent

You have a set of data that is partitioned by a predetermined value, which you can use to dynamically load the data based on what is requested by the application.

Motivation

By partitioning the data by a common value, you can avoid significant amounts of processing time by looking only where the data would exist.

The added caveat to this pattern is this should be handled transparently, so you can run the same MapReduce join over and over again, but over different data sets. This is done by simply changing the data you are querying for, rather than changing the implementation of the job. A great way to do this would be to strip away how the data is stored on the file system and instead put it inside an input format. The input format knows where to locate and get the data, allowing the number of map tasks generated to change based on the query.

Applicability

 

Structure

>The InputFormat is where this pattern comes to life. The getSplits method is where we pay special attention, because it determines the input splits that will be created, and thus the number of map tasks. While the configuration is typically a set of files, configuration turns into more of a query than a set of file paths.

>The RecordReader implementation depends on how the data is being stored. If it is a file-based input, something like a LineRecordReader can be used to read key/value pairs from a file. If it is an external source, you’ll have to customize something more to your needs.

Consequences

Partition pruning changes only the amount of data that is read by the MapReduce job, not the eventual outcome of the analytic. The main reason for partition pruning is to reduce the overall processing time to read in data. This is done by ignoring input that will not produce any output before it even gets to a map task.

Known uses

 

Resemblances

SQL:

CREATE TABLE parted_data

(foo_date DATE)

PARTITION BY RANGE(foo_date)

(PARTITION foo_2012 VALUES LESS THAN(TO_DATA(‘01/01/2013’, ‘DD/MM/YYYY’)),

(PARTITION fool_2011 VALUES LESS THAN(TO_DATA(‘01/01/2012’, ‘DD/MM/YYYY’)),

);

SELECT * FROM parted_data WHERE fool_date=TO_DATE(‘01/31/2012’);

Performance analysis

The data in this pattern is loaded into each map task is as fast as in any other pattern. Only the number of tasks changes based on the query at hand. Utilizing this pattern can provide massive gains by reducing the number of tasks that need to be created that would not have generated output anyways. Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job.

Examples

Partitioning by last access date to Redis instances

Querying for user reputation by last access date

阅读(2614) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~