Chinaunix首页 | 论坛 | 博客
  • 博客访问: 149495
  • 博文数量: 28
  • 博客积分: 1646
  • 博客等级: 上尉
  • 技术积分: 405
  • 用 户 组: 普通用户
  • 注册时间: 2007-03-12 14:28
文章分类

全部博文(28)

文章存档

2013年(28)

我的朋友

分类: HADOOP

2013-04-10 08:48:42

Pattern Name

External Source Input

Category

Input and Output Patterns

Description

The external source input pattern doesn’t load data from HDFS, but instead from some system outside of Hadoop, such as an SQL database or a web service.

Intent

You want to load data in parallel from a source that is not part of your MapReduce framework.

Motivation

With this pattern, you can hook up the MapReduce framework into an external source, such as a database or a web service, and pull the data directly into the mappers.

In a MapReduce approach, the data is loaded in parallel rather than in a serial fashion. The caveat to this is that the source needs to have well-defined boundaries on which data is read in parallel in order to scale.

Applicability

 

Structure

>The InputFormat creates all the InputSplit objects, which may be based on a custom object. An input split is a chunk of logical input, and that largely depends on the format in which it will be reading data.

>The InputSplit contains all the knowledge of where the sources are and how much of each source is going to be read. The framework uses the location information to help determine where to assign the map task. A custom InputSpit must also implement the Writable interface, because the framework uses the methods of this interface to transmit the input split information to a TaskTracker. The number of map tasks distributed among TaskTrackers is equivalent to the number of input splits generated by the input format. The InputSplit is then used to initialize a RecordReader for processing.

>The RecordReader uses the job configuration provided and InputSplit information to read key/value pairs. The implementation of this class depends on the data source being read. It sets up any connections required to read data from the external source, such as using JDBC to load from a database or creating a REST call to access a RESTful service.

 

Consequences

Data is loaded from the external source into the MapReduce job and the map phase doesn’t know or care where that data came from.

Known uses

 

Resemblances

 

Performance analysis

The bottleneck for a MapReduce job implementing this pattern is going to be the source or the network. The source may not scale well with multiple connections. Another problem may be the network infrastructure.

Examples

Reading from Redis instances

阅读(2489) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~