lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Bowyer <>
Subject Re: Strange bug when we enable faceting
Date Thu, 03 Nov 2011 02:25:06 GMT
For reference, the full answer to why this occurs is that when faceting 
is enabled in SOLR the filter cache becomes used to cache the main 
query, this is done such that, if a user starts performing drill downs 
on facets then the DocSet for their unadorned ‟main” query (that is the 
raw query without filters) is cached inside the filter cache such making 
it potentially available on the next request.

This is done by the FacetComponent as part of query preparation.

This prevents the engine from having to recompute the DocSet for the 
next request. Due to the cache effectively being a heap, these queries 
will evict from the filter cache fairly quickly due to lack of usage.

This is done inside the SolrIndexSearcher code in the following block at 
around lines 1000 inside getDocListC

----- %< -----
     } else {
       // do it the normal way...
(1)     if ((cmd.getFlags() & GET_DOCSET)!=0) {
         // this currently conflates returning the docset for the base 
query vs
         // the base query and all filters.
         DocSet qDocSet = getDocListAndSetNC(qr,cmd);
         // cache the docSet matching the query w/o filtering
(2)     if (qDocSet!=null && filterCache!=null && 
!qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet);
       } else {
----- >% -----

The code at (1) occurs because the facet component forced the request to 
generate a DocSet, which is then tested via the bitmask given, since 
this is true for most queries this causes the caching to occur at (2).

Sorry for asking redundant questions on the mailing list :S

On 02/11/11 17:09, Greg Bowyer wrote:
> Ignore this !!
> I discovered through testing and code review today just what things the
> filter cache is used for and why my previous thinking was wrong, I had
> the cache set too large to accomodate all of the other things the filter
> cache stores.
> On 02/11/11 11:17, Greg Bowyer wrote:
>> When I enable faceting in SOLR for some reason our incoming user queries
>> start becoming cached in the filter cache, this very quickly leads the
>> instance to run out of memory; we could lower the size of the
>> filtercache, but I feel this is a band-aid around a far odder problem.
>> I have been investigating the heap-dumps that were created on our
>> instances when we ran out of memory, these dumps show (unless yourkit is
>> being dishonest) that the filter-cache contains
>> BoostedQueries(BooleanQueries(DisjunctionMaxQueries))) objects, each of
>> which contains terms objects that I would not expect to see in the
>> filterCache.
>> A snapshot of the object graph can be seen here.
>> In terms of our index, queries and setup; have a solr 3.3 setup with
>> sharding, we have nodes that act as aggregators with the rest acting as
>> slaves or shards. As per recommendations, the aggregators act as
>> dispatchers for searches, but do not themselves surface any index data.
>> Most of our search queries differ on the search terms but generally have
>> the following form:
>>        path=/aggregator/
>> params={fl=docid,pid,score&start=0&q=dat+data+cartridge&fq=+parent_cids:438&fq=+dtype:(1+OR+2)&rows=20
>>        path=/select
>> params={fl=docid,score&start=0&q=polyethylene+bench+storage&enable=true&isShard=true&wt=javabin&fq=+rev_type:[1+TO+2]&fq=+parent_cids:25000500&fq=+dtype:(1+OR+2)&fsv=true&rows=20&version=2
>> Breaking this down, the fqs defined are against three fields:
>>        * parent_cids - This field contains roughly 1394 terms, there are a
>> few
>>                        permutations for this field, but I would expect no
>> more than
>>                        at most ~10000 fqs for this field
>>        * dtype - This field has 2 terms, and we only ever query it as
>> shown above,
>>                  its reserved for some future work and would at most only
>> ever have
>>                  8 terms
>>        * rev_type - Similer to dtype, we only have 3 terms in this field
>> All of our filters are not generally user accessible, and we ensure that
>> clients alway provide filter queries in the same order to remove the
>> duplication of fq's (that is, we go to some length to avoid things like
>> fq=+dtype(2+OR+1) appearing since we already cache fq=+dtype(1+OR+2)).
>> Our search handler is defined with some basic parameters as follows
>> ---- %<   ----
>> <requestHandler name="search" class="solr.SearchHandler" default="true">
>> <!-- default values for query parameters can be specified, these
>>        will be overridden by parameters in the request
>>       -->
>> <lst name="defaults">
>> <str name="echoParams">explicit</str>
>> <str name="qf">title^1.0 descr^0.5 mft^0.5 brand^0.5</str>
>> <str name="pf">title^3 descr^0.5</str>
>> <str name="boost">product(redir,bid)</str>
>> <str name="ps">4</str>
>> <str name="mm">50%</str>
>> <str name="defType">edismax</str>
>> <int name="rows">20</int>
>> <str name="facet">true</str>
>> <str name="facet.field">price_bucket</str>
>> <str name="facet.price_bucket.sort">count</str>
>> <str name="facet.price_bucket.mincount">1</str>
>> <str name="facet.price_bucket.limit">100</str>
>> <str name="facet.mincount">1</str>
>> </lst>
>> </requestHandler>
>> ---->% ----
>> price_bucket is a field that we deduce at index time, it takes a field
>> we store called price and creates a term that reflects a range (or
>> bucket) of prices that the given document falls into. I did originally
>> attempt to use facet counts directly but found that the instance failed
>> due to running out of memory; at the time it was assumed that our range
>> of prices and the granularity of our "buckets" were creating too many
>> filter queries. for reference there are 239 unique terms in the
>> price_bucket field.
>> At present our installation, indexing practices and queries are very
>> vanilla, we are doing nothing esoteric out of the box.
>> This is a fairly undesirable issue as it means that our filter-cache
>> rapidly fills rapidly, with cache items that are unlikely to ever be
>> required again.
>> Does anyone have any ideas on what could be causing this?
>> -- Greg Bowyer
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message