MapReduce Design Patterns - External Source Input-YoLaiYoQu-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

首页　| 　博文目录　| 　关于我

博客访问： 150532
博文数量： 28
博客积分： 1646
博客等级：上尉
技术积分： 405
用户组：普通用户
注册时间： 2007-03-12 14:28

文章分类

全部博文（28）

Android（0）
MySQL（0）
算法与数据结构（5）
云计算（9）

openstack（2）

Hadoop（7）
C（0）
Java（0）
Python（0）
Linux（6）
面试（8）

每日一题（5）
未分配的博文（0）

文章存档

2013年（28）

我的朋友

最近访客

推荐博文

相关博文

MapReduce Design Patterns - External Source Input

分类： HADOOP

2013-04-10 08:48:42

Pattern Name	External Source Input
Category	Input and Output Patterns
Description	The external source input pattern doesn’t load data from HDFS, but instead from some system outside of Hadoop, such as an SQL database or a web service.
Intent	You want to load data in parallel from a source that is not part of your MapReduce framework.
Motivation	With this pattern, you can hook up the MapReduce framework into an external source, such as a database or a web service, and pull the data directly into the mappers. In a MapReduce approach, the data is loaded in parallel rather than in a serial fashion. The caveat to this is that the source needs to have well-defined boundaries on which data is read in parallel in order to scale.
Applicability
Structure	>The InputFormat creates all the InputSplit objects, which may be based on a custom object. An input split is a chunk of logical input, and that largely depends on the format in which it will be reading data. >The InputSplit contains all the knowledge of where the sources are and how much of each source is going to be read. The framework uses the location information to help determine where to assign the map task. A custom InputSpit must also implement the Writable interface, because the framework uses the methods of this interface to transmit the input split information to a TaskTracker. The number of map tasks distributed among TaskTrackers is equivalent to the number of input splits generated by the input format. The InputSplit is then used to initialize a RecordReader for processing. >The RecordReader uses the job configuration provided and InputSplit information to read key/value pairs. The implementation of this class depends on the data source being read. It sets up any connections required to read data from the external source, such as using JDBC to load from a database or creating a REST call to access a RESTful service.
Consequences	Data is loaded from the external source into the MapReduce job and the map phase doesn’t know or care where that data came from.
Known uses
Resemblances
Performance analysis	The bottleneck for a MapReduce job implementing this pattern is going to be the source or the network. The source may not scale well with multiple connections. Another problem may be the network infrastructure.
Examples	Reading from Redis instances

阅读(2503) | 评论(0) | 转发(0) |

0

上一篇：MapReduce Design Patterns - External Source Output

下一篇：MapReduce Design Patterns - Partition Pruning

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们