C++,python,热爱算法和机器学习
全部博文(1214)
分类: NOSQL
2016-10-20 16:55:11
I have been playing with ElasticSearch for a while now, both at work as well as personally. In recent discussions I came across a use case to perform faceted searches and figured this would be a good topic for a blog post. Lets explore by example how to implement faceted searches using both the older facet module as well as the newer aggregations module.
First things first. What is a faceted search? If you go to amazon and search for a product like bluetooth, you will be presented with a paginated set of responses. On the left navigation pane you will be presented with many “facets” or categories with counts of how many products are available in each category. You can click on any of those categories and dive even deeper. As you click-through these links, the categories themselves change based on features of the current set of results. This type of drill down filtering starting from an initial search query is what is termed as faceted searching and can be a powerful tool in engaging your users.
Please ensure you have and running. You will also need the command line utility curl or a similar tool to make HTTP calls. If you prefer you can also run these queries (minus the curl) in the ElasticSearch browser plugin of our choice. To keep it simple I use curl.
Download the zip TestData, which contains a shell script to load the test data into the books index. Each document represents a book with a title and a category. The category can be either finance, programming or fiction. The user searches for the word ‘evil’ which will return the results as well as the facets that are contained in them. Next the user clicks on the category ‘fiction’ and expects to see just those results and a set of facets. More on the last part in a bit. So lets dive straight into an example.
The data set from the loaddata.sh script is…
#!/bin/bash curl -XPOST localhost:9200/books/book -d '{"title":"the reasons for the crash", "category":"finance"}' curl -XPOST localhost:9200/books/book -d '{"title":"Personal Finance Success", "category":"finance"}' curl -XPOST localhost:9200/books/book -d '{"title":"Asia Rising", "category":"finance"}' curl -XPOST localhost:9200/books/book -d '{"title":"Evil Speculators", "category":"finance"}' curl -XPOST localhost:9200/books/book -d '{"title":"Evil World of Hedge Funds", "category":"finance"}' curl -XPOST localhost:9200/books/book -d '{"title":"The Evil Twin", "category":"fiction"}' curl -XPOST localhost:9200/books/book -d '{"title":"The Evil Empire", "category":"fiction"}' curl -XPOST localhost:9200/books/book -d '{"title":"Learn Java In 21 Days", "category":"programming"}' curl -XPOST localhost:9200/books/book -d '{"title":"Learn Scala In 21 Days", "category":"programming"}' curl -XPOST localhost:9200/books/book -d '{"title":"Learn Ruby In 2 Days", "category":"programming"}' curl -XPOST localhost:9200/books/book -d '{"title":"The evil world of programming", "category":"programming"}' curl -XPOST localhost:9200/books/book -d '{"title":"Enterprise Integration Patterns", "category":"programming"}' curl -XPOST localhost:9200/books/book -d '{"title":"Coding in Asia", "category":"programming"}'
#!/bin/bash
curl - XPOST localhost : 9200 / books / book - d '{"title":"the reasons for the crash", "category":"finance"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Personal Finance Success", "category":"finance"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Asia Rising", "category":"finance"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Evil Speculators", "category":"finance"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Evil World of Hedge Funds", "category":"finance"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"The Evil Twin", "category":"fiction"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"The Evil Empire", "category":"fiction"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Learn Java In 21 Days", "category":"programming"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Learn Scala In 21 Days", "category":"programming"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Learn Ruby In 2 Days", "category":"programming"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"The evil world of programming", "category":"programming"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Enterprise Integration Patterns", "category":"programming"}' curl - XPOST localhost : 9200 / books / book - d '{"title":"Coding in Asia", "category":"programming"}' |
Lets perform a simple search which returns all matching documents from ElasticSearch with the word ‘evil’ appearing in the document. Lets also request a facet query on the results specifically for the category field. Later on we will use the newer aggregations module.
curl localhost:9200/books/book/_search?pretty -d ' { "query": { "query_string": { "query": "evil" } }, "facets": { "format": { "terms": { "field": "category" } } } }'
curl localhost : 9200 / books / book / _search ? pretty - d '
{
"query": {
"query_string": { "query": "evil"
}
},
"facets": {
"format": { "terms": { "field": "category"
}
}
}
}'
|
This returns 5 documents as expected from the test data. I have edited out the document response and preserves just the facet results. Now look carefully at the facet response …
"terms": [ { "term": "finance", "count": 2 }, { "term": "fiction", "count": 2 }, { "term": "programming", "count": 1 } ]
"terms" : [
{
"term" : "finance" ,
"count" : 2
} ,
{
"term" : "fiction" ,
"count" : 2
} ,
{
"term" : "programming" ,
"count" : 1
}
]
|
What this means is that, the query results contain documents which fall in the three categories above. The count tell us how many documents meet the search criteria in each of the categories.
The next thing the user will do in faceted search would be say filter the results further by one of the categories above. Lets pick fiction. The next query to ElasticSearch can take one of two forms depending on your needs. The simple thing would be to modify our earlier query to add a simple filter condition to return only books having a category of fiction.
Query:
curl localhost:9200/books/book/_search?pretty -d ' { "query": { "query_string": { "query": "evil" } }, "filter": { "term": { "category": "fiction" } }, "facets": { "category_facet": { "terms": { "field": "category" } } } }'
curl localhost : 9200 / books / book / _search ? pretty - d '
{
"query": {
"query_string": { "query": "evil"
}
},
"filter": {
"term": {
"category": "fiction"
}
},
"facets": {
"category_facet": { "terms": { "field": "category"
}
}
}
}'
|
Response:
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 2, "max_score": 0.8465736, "hits": [ { "_index": "books", "_type": "book", "_id": "CRkjqtn_TZmvhRyxm5Du5A", "_score": 0.8465736, "_source": { "title": "The Evil Twin", "category": "fiction" } }, { "_index": "books", "_type": "book", "_id": "w_KBINHWRJS8v9nQHj1E3g", "_score": 0.8465736, "_source": { "title": "The Evil Empire", "category": "fiction" } } ] }, "facets": { "category_facet": { "_type": "terms", "missing": 1, "total": 5, "other": 0, "terms": [ { "term": "finance", "count": 2 }, { "term": "fiction", "count": 2 }, { "term": "programming", "count": 1 } ] } } }
{
"took" : 1 ,
"timed_out" : false ,
"_shards" : {
"total" : 1 ,
"successful" : 1 ,
"failed" : 0
} ,
"hits" : {
"total" : 2 ,
"max_score" : 0.8465736 ,
"hits" : [
{
"_index" : "books" , "_type" : "book" , "_id" : "CRkjqtn_TZmvhRyxm5Du5A" , "_score" : 0.8465736 , "_source" : { "title" : "The Evil Twin" , "category" : "fiction"
}
} ,
{
"_index" : "books" , "_type" : "book" , "_id" : "w_KBINHWRJS8v9nQHj1E3g" , "_score" : 0.8465736 , "_source" : { "title" : "The Evil Empire" , "category" : "fiction"
}
}
]
} ,
"facets" : {
"category_facet" : { "_type" : "terms" , "missing" : 1 , "total" : 5 , "other" : 0 , "terms" : [
{
"term" : "finance" , "count" : 2
} ,
{
"term" : "fiction" , "count" : 2
} ,
{
"term" : "programming" , "count" : 1
}
]
}
}
}
|
As expected the query with its additional filter returns just the fiction books. But take a look at the facet results. Even though we just got two fiction documents, it still shows us all the other matching term facets (finance & programming). If your use case needs to show all the facets regardless of how deep the user has clicked in, then you are fine with this query format. But if you have a complex requirement with multiple facets and would like to have the UI only display relevant facets from the current search results then you will need a slightly different version of the query, using filtered query. Confusing I know but read on. The way facets work is to take the results of the query and aggregate on that. Filtered query allows you to create a master query that is composed of both a query and a filter, the results of which are then faceted. See query below…
curl -XGET "" -d' { "query": { "filtered": { "query": { "query_string": { "query": "evil" } }, "filter": { "term": { "category": "fiction" } } } }, "facets": { "category_facet": { "terms": { "field": "category" } } } }'
curl - XGET "" - d '
{
"query": {
"filtered": { "query": { "query_string": { "query": "evil"
}
},
"filter": { "term": { "category": "fiction"
}
}
}
},
"facets": {
"category_facet": { "terms": { "field": "category"
}
}
}
}'
|
Response:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.8465736, "hits" : [ { "_index" : "books", "_type" : "book", "_id" : "CRkjqtn_TZmvhRyxm5Du5A", "_score" : 0.8465736, "_source" : {"title":"The Evil Twin", "category":"fiction"} }, { "_index" : "books", "_type" : "book", "_id" : "w_KBINHWRJS8v9nQHj1E3g", "_score" : 0.8465736, "_source" : {"title":"The Evil Empire", "category":"fiction"} } ] }, "facets" : { "category_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "fiction", "count" : 2 } ] } } }
{
"took" : 2 ,
"timed_out" : false ,
"_shards" : {
"total" : 1 ,
"successful" : 1 ,
"failed" : 0
} ,
"hits" : {
"total" : 2 ,
"max_score" : 0.8465736 ,
"hits" : [ {
"_index" : "books" , "_type" : "book" , "_id" : "CRkjqtn_TZmvhRyxm5Du5A" , "_score" : 0.8465736 , "_source" : { "title" : "The Evil Twin" , "category" :"fiction" }
} , {
"_index" : "books" , "_type" : "book" , "_id" : "w_KBINHWRJS8v9nQHj1E3g" , "_score" : 0.8465736 , "_source" : { "title" : "The Evil Empire" , "category" :"fiction" }
} ]
} ,
"facets" : {
"category_facet" : { "_type" : "terms" , "missing" : 0 , "total" : 2 , "other" : 0 , "terms" : [ { "term" : "fiction" , "count" : 2
} ]
}
}
}
|
The results are the same, but this time you only see relevant facet responses and not facets that are not part of the response.
Finally this example would not be complete without showing an example of ElasticSearch 1.x and its introduction of aggregations which are recommended over facets. Facets could be deprecated at some point in future.
Query:
curl -XGET "" -d' { "query": { "filtered": { "query": { "query_string": { "query": "evil" } }, "filter": { "term": { "category": "fiction" } } } }, "aggregations": { "category_aggs": { "terms": [ { "field": "category" } ] } } }'
curl - XGET "" - d '
{
"query": {
"filtered": { "query": { "query_string": { "query": "evil"
}
},
"filter": { "term": { "category": "fiction"
}
}
}
},
"aggregations": { "category_aggs": { "terms": [
{
"field": "category"
}
]
}
}
}'
|
Response:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.8465736, "hits" : [ { "_index" : "books", "_type" : "book", "_id" : "CRkjqtn_TZmvhRyxm5Du5A", "_score" : 0.8465736, "_source" : {"title":"The Evil Twin", "category":"fiction"} }, { "_index" : "books", "_type" : "book", "_id" : "w_KBINHWRJS8v9nQHj1E3g", "_score" : 0.8465736, "_source" : {"title":"The Evil Empire", "category":"fiction"} } ] }, "aggregations" : { "category_aggs" : { "buckets" : [ { "key" : "fiction", "doc_count" : 2 } ] } } }
{
"took" : 1 ,
"timed_out" : false ,
"_shards" : {
"total" : 1 ,
"successful" : 1 ,
"failed" : 0
} ,
"hits" : {
"total" : 2 ,
"max_score" : 0.8465736 ,
"hits" : [ {
"_index" : "books" , "_type" : "book" , "_id" : "CRkjqtn_TZmvhRyxm5Du5A" , "_score" : 0.8465736 , "_source" : { "title" : "The Evil Twin" , "category" :"fiction" }
} , {
"_index" : "books" , "_type" : "book" , "_id" : "w_KBINHWRJS8v9nQHj1E3g" , "_score" : 0.8465736 , "_source" : { "title" : "The Evil Empire" , "category" :"fiction" }
} ]
} ,
"aggregations" : { "category_aggs" : { "buckets" : [ { "key" : "fiction" , "doc_count" : 2
} ]
}
}
}
|