lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <>
Subject Re: scalability recommendations for large performance-intensive indexes
Date Wed, 08 Feb 2006 22:04:00 GMT
Hi Vince, sounds like the same issue I highlighted recently on the 
java-dev list.

See here:

The problem lies in the underlying cost of reading TermDocs for very 
common terms (a problem for  both queries and filters)

For your issue, given the  problem field only has 2 values you can 
comfortably cache 2 Bitsets and use them as filters.
This would only require roughly (35m/8)*2 bytes or about 9 meg of RAM.


Vince Taluskie wrote:

>hello All,
>I'm looking for some advice on how to improve scalability - we have a fairly
>large lucene index of 35M documents, max 1k document size (most much
>smaller) and 14 fields.   We combine descriptive text together into a
>"contents" field and search on that and have been very pleased with handling
>almost 100 queries/sec at about 8-12ms for the average search.
>Prior to that we had a common attribute for which about 50% of the docs had
>one value and the rest had the other value and the boolean query slowed
>response times very significantly.  We handled this by breaking up our
>indexes so that the index only contained one attribute or the other and
>eliminated the need for the boolean - this was a 7-8x improvement.
>Now we're back to wanting to add another attribute to the documents for
>which most of the docs will have one value and much fewer will have the
>other and although it sounds so simple - my limited testing with an 85/15
>ratio is showing another big hit on performance with the boolean.    A two
>term boolean search without the attribute is about 7-8ms, adding the
>attribute to the boolean search increases the elapsed time to 4x and 2x of
>original for the 85% and 15% frequencies respective.
>I had some hope that a QueryFilter would really help out but it turns out to
>be much much slower:  the 85% term ends up taking a whopping 336ms and the
>15% term ends up around 65ms which is 40x and 8x slower than the original
>8ms query speed without the additional attribute.
>I have to ask if there's not a better way to handle the addition of an
>common attribute with a few possible values across the index.  Any other
>recommended approaches?
>Thanks in advance,

To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message