lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Bowyer <gbow...@fastmail.co.uk>
Subject Strange bug when we enable faceting
Date Wed, 02 Nov 2011 18:17:58 GMT
When I enable faceting in SOLR for some reason our incoming user queries 
start becoming cached in the filter cache, this very quickly leads the 
instance to run out of memory; we could lower the size of the 
filtercache, but I feel this is a band-aid around a far odder problem.

I have been investigating the heap-dumps that were created on our 
instances when we ran out of memory, these dumps show (unless yourkit is 
being dishonest) that the filter-cache contains 
BoostedQueries(BooleanQueries(DisjunctionMaxQueries))) objects, each of 
which contains terms objects that I would not expect to see in the 
filterCache.

A snapshot of the object graph can be seen here.
http://gbowyer.freeshell.org/filter-cache2.html

In terms of our index, queries and setup; have a solr 3.3 setup with 
sharding, we have nodes that act as aggregators with the rest acting as 
slaves or shards. As per recommendations, the aggregators act as 
dispatchers for searches, but do not themselves surface any index data.

Most of our search queries differ on the search terms but generally have 
the following form:

     path=/aggregator/ 
params={fl=docid,pid,score&start=0&q=dat+data+cartridge&fq=+parent_cids:438&fq=+dtype:(1+OR+2)&rows=20

     path=/select 
params={fl=docid,score&start=0&q=polyethylene+bench+storage&enable=true&isShard=true&wt=javabin&fq=+rev_type:[1+TO+2]&fq=+parent_cids:25000500&fq=+dtype:(1+OR+2)&fsv=true&rows=20&version=2

Breaking this down, the fqs defined are against three fields:

     * parent_cids - This field contains roughly 1394 terms, there are a 
few
                     permutations for this field, but I would expect no 
more than
                     at most ~10000 fqs for this field

     * dtype - This field has 2 terms, and we only ever query it as 
shown above,
               its reserved for some future work and would at most only 
ever have
               8 terms

     * rev_type - Similer to dtype, we only have 3 terms in this field

All of our filters are not generally user accessible, and we ensure that 
clients alway provide filter queries in the same order to remove the 
duplication of fq's (that is, we go to some length to avoid things like 
fq=+dtype(2+OR+1) appearing since we already cache fq=+dtype(1+OR+2)).

Our search handler is defined with some basic parameters as follows

---- %< ----
<requestHandler name="search" class="solr.SearchHandler" default="true">
<!-- default values for query parameters can be specified, these
     will be overridden by parameters in the request
    -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="qf">title^1.0 descr^0.5 mft^0.5 brand^0.5</str>
<str name="pf">title^3 descr^0.5</str>
<str name="boost">product(redir,bid)</str>
<str name="ps">4</str>
<str name="mm">50%</str>
<str name="defType">edismax</str>
<int name="rows">20</int>
<str name="facet">true</str>
<str name="facet.field">price_bucket</str>
<str name="facet.price_bucket.sort">count</str>
<str name="facet.price_bucket.mincount">1</str>
<str name="facet.price_bucket.limit">100</str>
<str name="facet.mincount">1</str>
</lst>
</requestHandler>
---- >% ----

price_bucket is a field that we deduce at index time, it takes a field 
we store called price and creates a term that reflects a range (or 
bucket) of prices that the given document falls into. I did originally 
attempt to use facet counts directly but found that the instance failed 
due to running out of memory; at the time it was assumed that our range 
of prices and the granularity of our "buckets" were creating too many 
filter queries. for reference there are 239 unique terms in the 
price_bucket field.

At present our installation, indexing practices and queries are very 
vanilla, we are doing nothing esoteric out of the box.

This is a fairly undesirable issue as it means that our filter-cache 
rapidly fills rapidly, with cache items that are unlikely to ever be 
required again.

Does anyone have any ideas on what could be causing this?

-- Greg Bowyer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message