lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters
Date Thu, 21 Aug 2014 18:21:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105715#comment-14105715
] 

Toke Eskildsen commented on SOLR-5894:
--------------------------------------

Forked Lucene/Solr on GitHub for better project management: https://github.com/tokee/lucene-solr
. Experimental branches for sparse faceting are lucene_solr_4_8_sparse and lucene_solr_4_9_sparse
and will contain the latest code. The patch above for 4.7.1 is a little behind the GitHub
branches, as it does not have speed up for second phase (getting counts for specific terms)
of distributed faceting.

> Speed up high-cardinality facets with sparse counters
> -----------------------------------------------------
>
>                 Key: SOLR-5894
>                 URL: https://issues.apache.org/jira/browse/SOLR-5894
>             Project: Solr
>          Issue Type: Improvement
>          Components: SearchComponents - other
>    Affects Versions: 4.7.1
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: faceted-search, faceting, memory, performance
>         Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch,
SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip,
SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, author_7M_tags_1852_logged_queries_warmed.png,
sparse_2000000docs_fc_cutoff_20140403-145412.png, sparse_5000000docs_20140331-151918_multi.png,
sparse_5000000docs_20140331-151918_single.png, sparse_50510000docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in facets and
extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the number of
hits and the number of references from hits to tags. This phase is not the focus here.
> The extraction time scales with the number of unique tags in the search result, but is
also heavily influenced by the total number of unique tags in the facet as every counter,
0 or not, is visited by the extractor (at least for count order). For fields with millions
of unique tag values this means 10s of milliseconds added to the minimum response time (see
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ for a test
on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter structure being
a plain int-array of size #unique_tags. Switching to a sparse structure, where only the tag
counters > 0 are visited, makes the extraction time linear to the number of unique tags
in the result set.
> Unfortunately the number of unique tags in the result set is unknown at collect time,
so it is not possible to reliably select sparse counting vs. full counting up front. Luckily
there exists solutions for sparse sets that has the property of switching to non-sparse-mode
without a switch-penalty, when the sparse-threshold is exceeded (see http://programmingpraxis.com/2012/03/09/sparse-sets/
for an example). This JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both single-
and multi-value, with and without doc-values. Sort by count only. The patch applies cleanly
to Solr 4.6.1 and should integrate well with everything as all functionality is unchanged.
After patching, the following new parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=10000 the minimum amount of unique tags in the given field for
sparse faceting to be active. This is used for auto-selecting whether sparse should be used
or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting this too
low means that only very small result sets are handled as sparse. Setting this too high will
result in a large performance penalty if the result set blows the sparse tracker. Values between
0.04 and 0.1 seems to work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This saves memory,
but performance will differ. Whether performance will be better or worse depends on the corpus.
Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of unique tags
in the search result exceeds this fraction of the sparse tracker, do not perform sparse tracking.
The estimate is based on the assumption that references from documents to tags are distributed
randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and keep in
memory, ready for usage. Clearing and re-using a counter is faster that allocating it fresh
from the heap. Setting the pool size to 0 means than a new sparse counter will be allocated
each time, just as standard Solr faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the pool.
> The parameters needs to be given together with standard faceting parameters, such as
facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The defaults
should be usable, so simply appending facet.sparse=true to the URL is a good start.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message