lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3079) Facetiing module
Date Thu, 23 Jun 2011 13:37:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053847#comment-13053847
] 

Shai Erera commented on LUCENE-3079:
------------------------------------

I would like to contribute IBM's faceted search package (been wanting to do that for quite
a while). The package supports the following features/capabilities (at a high-level):

* Taxonomy index -- manages trees of 'categories'. You can view example of taxonomies at e.g.
the Open Directory Project.
** It's a Lucene index managed alongside the content index.
** Builds the taxonomy on-the-fly (i.e. as categories are discovered).
** In general it maps a category hierarchy to ordinals (integers). For example, the category
/date/2011/06/24 will create the following entry in the taxonomy index:
*** /date, ordinal=1
*** /date/2011, ordinal=2
*** /date/2011/06, ordinal=3
*** /date/2011/06/24, ordinal=4

* FacetsDocumentBuilder which receives a list of categories that are associated w/ the document
(can be of several dimensions) and:
** Fetches the ordinals of the category components from the taxonomy index (adding them to
it on-the-fly).
** Indexes them in a (compressed) payload for the document (so for the above category example,
4 payloads will be indexed for the document).
** FDB can be used to augment a Document with other fields for indexing (it adds its own Field
objects).

* FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to count and
returns the top-K categories for each requested facet:
** The root can denote any node in the category tree (e.g., 'count all facets under /date/2011')
** top-K can be returned for the top most K immediate children of root, or any top-K in the
sub-tree of root.

* Counting algorithm (at a high-level):
** Fetch the payload for every matching document.
** Increment by 1 the count of every ordinal that was encountered (even for facets that were
not requested by the user)
** After all ordinals are counted, compute the top-K on the ones the user requested
** Label the result facets

* Miscellaneous features:
** *Sampling* algorithm allows for more efficient facets counting/accumulation, while still
returning exact counts for the top-K facets.
** *Complements* algorithm allows for more efficient facets counting/accumulation, when the
number of results is > 50% of the docs in the index (we keep a total count of facets, count
facets on the docs that did not match the query and subtract).
*** Complements can be used to count facets that do not appear in any of the matching documents
(of this result set). This does not exist in the package though ... yet.
** *Facets partitioning* -- if the taxonomy is huge (i.e. millions of categories), it is better
to partition them at indexing time, so that search time is faster and consumes less memory.
Note that this is required because of the approach of counting all (allocating a count array)
and then keeping only the results of interest.
** *Category enhancements* allow storing 'metadata' with categories in the index, so that
more than simple counting can be implemented:
*** *weighted facets* (built on top of enhancements) allows associating a weight w/ each category,
and use smarter counting techniques at runtime. For example, if facets are generated by an
analytics component, the confidence level can be set as the category's weight. If tags are
indexed as facets (for e.g. generating a tag cloud for the result set), the number of times
the document was tagged by the tag can be set as the tag's weight.
** That that facets are indexed in the payloads of documents allows managing very large taxonomies
and indexes, without blowing up the RAM at runtime (but incur some search performance overhead).
However, the payloads can be loaded up to RAM (like in FieldCache) in which case runtime becomes
much faster.
*** However facets are stored is abstracted though by a CountingList API, so we can definitely
explore other means of storing the facet ordinals. Actually, the CountingList API allows us
to read the ordinals from disk or RAM w/o affecting the rest of the algorithm at all.
** I did not want to dive too deep on the API here, but the runtime API is very extensible
and allows one to use FacetsCollector for the simple cases, or lower-level API to get more
control on the process. You can look at: FacetRequest, FacetSearchParams, FacetResult, FacetResultNode,
FacetsCollector, FacetsAccumulator, FacetsAggregator for a more extensive set of API to use.

* The package comes with example code which shows how to use the different features I've mentioned.
There are also unit tests for ensuring the example code works :).

* The package comes with a very extensive tests suite and is in use by many of our products
for a long time, so I can state that it's very stable.

* Some rough performance numbers:
** Collection of 1M documents, few hierarchical facets per document (we call it the 'Amazon'
case)
** Query matches 49.8% of the docs (~500K)
** No sampling / complements were used.
** Facets read from disk
** Takes 0.4 seconds to execute the query and count facets.

It will take me a few days to prepare a patch (lots of code) - will upload it as soon as it's
ready.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message