Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4584731
  • 博文数量: 1214
  • 博客积分: 13195
  • 博客等级: 上将
  • 技术积分: 9105
  • 用 户 组: 普通用户
  • 注册时间: 2007-01-19 14:41
个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文(1214)

文章存档

2021年(13)

2020年(49)

2019年(14)

2018年(27)

2017年(69)

2016年(100)

2015年(106)

2014年(240)

2013年(5)

2012年(193)

2011年(155)

2010年(93)

2009年(62)

2008年(51)

2007年(37)

分类: NOSQL

2016-10-01 09:12:19

原文地址:http://substantial.com/blog/2013/01/16/building-faceted-search-with-elasticsearch
What is Faceted Search?

 allows users to explore a collection of information by successively applying filters in whatever order they choose. Examples of faceted search include the search interfaces on LinkedIn, many e-commerce sites such as Newegg and Amazon, and clients of ours like  and the recently launched Artifex Press.

A while back, we built a custom faceted search engine running on MongoDB for one of our projects. It was pretty cool, but it relied on MongoDB map-reduce operations which turned out to be too slow once more than a few people were using the system at the same time. We needed to replace our custom engine with something faster. At the same time, we also wanted to improve our application’s textual searching capabilities - we wanted things like stemming and relevance-based sorting to allow us to promote certain types of results over others.

Enter ElasticSearch

 is a distributed open source search server based on Apache Lucene. It provides a RESTful JSON interface for queries, which makes it suitable for use with almost any programming language. It supports relevance scoring, various analyzers, stemming, you name it. It supports both faceting and percolation, the former of which we’re going to look at here.

We chose ElasticSearch over the more familiar Solr for a few reasons. We liked its REST interface, and we liked not having to manage XML files. ElasticSearch is self-contained, so we didn’t need to mess with Tomcat or any other servlet containers.

The final thing that pushed us towards ElasticSearch was the availability of , a great gem which significantly simplified integrating ElasticSearch with our -based application models. Tire offers some of the same functionality as a Solr-focused gem like , but Tire was more attractive to us since its Mongoid integration is actively maintained. In contrast,  exists, but is not maintained.

Some specific definitions which are important as we get into this:

Facet: A group of available filters which can be applied to a set of search results. Facets in a music search engine might include “Format” and “Genre”.

Filter: An option within a facet. For a music site, a “Format” facet might include “MP3”, “FLAC”, and “Vinyl” filters. Filters often include a number to display how many search results would be available if that filter were applied, such as “Vinyl (160)”.

Designing Faceted Search

Faceted search design varies wildly across applications. In some faceted searches, the available facets and filter counts do not update as filters are applied and removed. Instead, facets always show the full set of filters generated from the initial search results. When the user applies multiple filters, they build up a boolean ORquery, with a new OR added for each filter. LinkedIn’s faceted search implementation works this way.

Other faceted searches are adaptive, meaning the facets are regenerated each time a filter is applied. Generally speaking, adaptive faceted searches are more computationally intensive, and may be more difficult to implement. But for some data sets and search user experiences, they may offer more effective guidance for your users, and thus more successful searches. Newegg’s “Narrow Results” is an example of adaptive faceted search.

For further reading on the design considerations that go into faceted search, I suggest:

OK, Let’s Build Some Queries

For these examples, let’s say we’re building a faceted search for a music retailer. This retailer sells music in various formats, including FLAC, MP3, and Vinyl. They want users to be able to filter search results by format.

A fairly basic ElasticSearch query object incorporating a “format” facet might look like this:

{
  "query": {
    "query_string": {
       "query": "vampire disco"
     }
  },
  "facets": {
    "format": {
      "terms": {
        "field": "format"
      }
    }
  }
}

When we run this query against our database of music for sale, we end up with a Format facet we’d display roughly like this:

Format
 - MP3 (3)
 - FLAC (2)
 - Vinyl (1)

The interesting part starts when the user chooses to apply filters. We have two basic options for how we can apply a filter to our query, which will create two different search experiences:

  1. Run the same query, and apply the filter later, creating a what I’ll term a “standard” faceted search.
  2. Modify the base query to incorporate the filter, creating an “adaptive” faceted search.
Option 1: “Standard” faceted search

Starting with option #1, if we apply a filter to the same base query, our new query object will look like this:

{
  "query" : {
    "query_string" : {
       "query" : "vampire disco"
     }
  },
  "filter" : {
    "term" : {
      "format" : "FLAC"
     }
  },
  "facets" : {
    "format" : {
      "terms" : {
        "field" : "format"
      }
    }
  }
}

This new query object filters our result set to only contain items which have theformat property “FLAC”. When we use a filter this way, ElasticSearch still calculates facets based on our original, unfiltered query. So, our resulting “Format” facet will look exactly the same as it did before we added the filter property!

This could be desireable or undesireable behavior depending on how our search interface is designed, and what kind of user experience we’re trying to create.

To continue this example, let’s say our our search interaction is based on checkboxes, and allows multiple filters to be applied simultaneously. Our user next clicks “MP3” in the “Format” facet, and now has both the “MP3” and “FLAC” filters selected. We apply both of those in our query, so our new query looks something like this:

{
  "query" : {
    "query_string" : {
       "query" : "vampire disco"
     }
  },
  "filter" : {
    "terms" : {
      "format" : ["FLAC", "MP3"],
      "execution" : "and"
     }
  },
  "facets" : {
    "format" : {
      "terms" : {
        "field" : "format"
      }
    }
  }
}

If our search metadata allows multiple entries in an item’s format field, this might work fine. If we have a single “vampire disco” item with a format field like FLAC, MP3, then the item will still appear in search results after we apply this new filter.

However, what happens if our search data is setup such that the format field never contains more than one value? In that case, the “MP3” and “FLAC” versions of “Vampire Disco” will exist in the database as seperate items, and neither item will be in our search results anymore since neither of them meets our condition of FLAC and MP3. We’ve now allowed the user to drive their search to a state where it’s producing zero results. This is bad, and we should feel bad about it.

One option to fix this is to change the execution strategy for the filter to or rather than and. Doing so would lead us toward a search experience like LinkedIn’s, where each filter applied after the first one widens the result set. Another option might be to change our search interactions to use radio buttons or a drop down instead of checkboxes, so that the user could never select more than one filter at a time.

The final option we’ll investigate is to make our facets adaptive, and hide irrelevant facets each time the user applies a filter.

Option 2: “Adaptive” faceted search

Going back to our original query:

{
  "query" : {
    "query_string" : {
       "query" : "vampire disco"
     }
  },
  "facets" : {
    "format" : {
      "terms" : {
        "field" : "format"
      }
    }
  }
}

This time, when our user applies the “FLAC” filter, we’re going to modify our original query. We have two options for modification: either shift to more complicated boolquery, like so:

{
  "query" : {
    "bool" : {
      "must" : [
        "query_string" : {
          "query" : "vampire disco"
        },
        "term" : {
          "format" : "FLAC"
        }
      ]
    }
  },
  "facets" : {
    "format" : {
      "terms" : {
        "field" : "format"
      }
    }
  }
}

Or, we can use a filtered query. ElasticSearch can be a little confusing with its vocabulary here - our very first example involved a query filter, whereas the example below details a filtered query.

{
  "query" : {
    "filtered" : {
      "query" : {
        "query_string" : {
          "query" : "vampire disco"
        }
      },
      "filter" : {
        "term" : {
          "format" : "FLAC"
        }
      }     
    }
  },
  "facets" : {
    "format" : {
      "terms" : {
        "field" : "format""
      }
    }
  }
}

Both approaches will cause our facets to be regenerated each time we apply a filter.

Assuming again that we have format as a single-value field in our search data, either of these approaches will prevent the user seeing the “MP3” option after they apply the “FLAC” filter. Preventing the user from selecting mutually exclusive options makes it less likely that they end up with zero search results.

The main difference in these two adaptive approaches is in their effect on result relevance scoring. The first option, where we added an extra condition to create abool query, will cause any relevance scoring we do on the search results to be recalculated. The latter option, creating a filtered query, does not affect relevance scoring, and so may be significantly faster.

If you’re not relying on ElasticSearch’s relevance scoring to order your search results, the last option is probably what you want, as it will produce the same output as thebool query, just faster.

Conclusions

To take advantage of ElasticSearch’s flexible query DSL, it’s important to understand not only the structure and relationships of your search data, but also the design of your search interface interactions, and the thinking that goes into the search user experience. Search interaction design and user experience decisions have a direct impact on how your ElasticSearch queries need to be structured, which in turn has an impact on how much computing power you’ll need to run ElasticSearch, and on how you’ll structure your application code.

As a final note, we recommend the . ElasticSearch worked well for us out of the box, but there are a number of parameters that need to be tweaked before you can successfully deploy it. In particular, you’ll almost certainly need to increase the number of files your ElasticSearch user is allowed to have open simultaneously - we hit the default limit well before we went to production.

阅读(863) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~