MapReduce Design Patterns - Partition Pruning-YoLaiYoQu-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

首页　| 　博文目录　| 　关于我

博客访问： 151594
博文数量： 28
博客积分： 1646
博客等级：上尉
技术积分： 405
用户组：普通用户
注册时间： 2007-03-12 14:28

文章分类

全部博文（28）

Android（0）
MySQL（0）
算法与数据结构（5）
云计算（9）

openstack（2）

Hadoop（7）
C（0）
Java（0）
Python（0）
Linux（6）
面试（8）

每日一题（5）
未分配的博文（0）

文章存档

2013年（28）

我的朋友

最近访客

推荐博文

相关博文

MapReduce Design Patterns - Partition Pruning

分类： HADOOP

2013-04-10 08:49:28

Pattern Name	Partition Pruning
Category	Input and Output Patterns
Description	Partition Prunning configures the way the framework pick input splits and drop files from being loaded into MapReduce based on the name of the file.
Intent	You have a set of data that is partitioned by a predetermined value, which you can use to dynamically load the data based on what is requested by the application.
Motivation	By partitioning the data by a common value, you can avoid significant amounts of processing time by looking only where the data would exist. The added caveat to this pattern is this should be handled transparently, so you can run the same MapReduce join over and over again, but over different data sets. This is done by simply changing the data you are querying for, rather than changing the implementation of the job. A great way to do this would be to strip away how the data is stored on the file system and instead put it inside an input format. The input format knows where to locate and get the data, allowing the number of map tasks generated to change based on the query.
Applicability
Structure	>The InputFormat is where this pattern comes to life. The getSplits method is where we pay special attention, because it determines the input splits that will be created, and thus the number of map tasks. While the configuration is typically a set of files, configuration turns into more of a query than a set of file paths. >The RecordReader implementation depends on how the data is being stored. If it is a file-based input, something like a LineRecordReader can be used to read key/value pairs from a file. If it is an external source, you’ll have to customize something more to your needs.
Consequences	Partition pruning changes only the amount of data that is read by the MapReduce job, not the eventual outcome of the analytic. The main reason for partition pruning is to reduce the overall processing time to read in data. This is done by ignoring input that will not produce any output before it even gets to a map task.
Known uses
Resemblances	SQL: CREATE TABLE parted_data (foo_date DATE) PARTITION BY RANGE(foo_date) (PARTITION foo_2012 VALUES LESS THAN(TO_DATA(‘01/01/2013’, ‘DD/MM/YYYY’)), (PARTITION fool_2011 VALUES LESS THAN(TO_DATA(‘01/01/2012’, ‘DD/MM/YYYY’)), ); SELECT * FROM parted_data WHERE fool_date=TO_DATE(‘01/31/2012’);
Performance analysis	The data in this pattern is loaded into each map task is as fast as in any other pattern. Only the number of tasks changes based on the query at hand. Utilizing this pattern can provide massive gains by reducing the number of tasks that need to be created that would not have generated output anyways. Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job.
Examples	Partitioning by last access date to Redis instances Querying for user reputation by last access date

阅读(2693) | 评论(0) | 转发(0) |

0

上一篇：MapReduce Design Patterns - External Source Input

下一篇：MapReduce Design Patterns - Outline

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们