2013年(28)
分类: HADOOP
2013-04-10 08:48:42
Pattern Name |
External Source Input |
Category |
Input and Output Patterns |
Description |
The external source input pattern doesn’t load data from HDFS, but instead from some system outside of Hadoop, such as an SQL database or a web service. |
Intent |
You want to load data in parallel from a source that is not part of your MapReduce framework. |
Motivation |
With this pattern, you can hook up the MapReduce framework into an external source, such as a database or a web service, and pull the data directly into the mappers. In a MapReduce approach, the data is loaded in parallel rather than in a serial fashion. The caveat to this is that the source needs to have well-defined boundaries on which data is read in parallel in order to scale. |
Applicability |
|
Structure |
>The InputFormat creates all the InputSplit objects, which may be based on a custom object. An input split is a chunk of logical input, and that largely depends on the format in which it will be reading data. >The InputSplit contains all the knowledge of where the sources are and how much of each source is going to be read. The framework uses the location information to help determine where to assign the map task. A custom InputSpit must also implement the Writable interface, because the framework uses the methods of this interface to transmit the input split information to a TaskTracker. The number of map tasks distributed among TaskTrackers is equivalent to the number of input splits generated by the input format. The InputSplit is then used to initialize a RecordReader for processing. >The RecordReader uses the job configuration provided and InputSplit information to read key/value pairs. The implementation of this class depends on the data source being read. It sets up any connections required to read data from the external source, such as using JDBC to load from a database or creating a REST call to access a RESTful service.
|
Consequences |
Data is loaded from the external source into the MapReduce job and the map phase doesn’t know or care where that data came from. |
Known uses |
|
Resemblances |
|
Performance analysis |
The bottleneck for a MapReduce job implementing this pattern is going to be the source or the network. The source may not scale well with multiple connections. Another problem may be the network infrastructure. |
Examples |
Reading from Redis instances |