2013年(28)
分类: HADOOP
2013-03-28 14:26:40
Pattern Name |
Inverted Index Summarizations |
Category |
Summarization Patterns |
Intent |
Generate an index from a data set to allow for faster searches or data enrichment capabilities. |
Motivation |
It is often convenient to index large data sets on keywords, so that searches can trace terms back to records that contain specific values. While building an inverted index does require extra processing up front, taking the time to do so can greatly reduce the amount of time it takes to find something. |
Applicability |
Inverted indexes should be used when quick search query responses are required. The results of such a query can be preprocessed and ingested into a database. |
Structure |
? The Mapper outputs the desired fields for the index as the key and the unique identifier as the value. ? Combiner can be omitted if you are just using the identity reducer. ? The partitioner is responsible for determining where values with the same key will eventually be copied by a reducer for final output. It can be customized for more efficient load balancing if the intermediate keys are not evenly distributed. ? The reducer will receive a set of unique record identifiers to map back to the input key. ? The final output is a set of part files that contain a mapping of field value to a set of unique IDs of records containing the associated field value. |
Consequences |
The output of the job will be a set of part files containing a single record per reducer input group. Each record will consist of the key and all aggregate values. |
Known uses |
|
Resemblances |
|
Performance analysis |
The performance of building an inverted index depends mostly on the computational cost of parsing the content in the mapper, the cardinality of the index keys, and the number of content identifiers per key. |
Inverted Index Examples |
Wikipedia reference inverted index |