lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <>
Subject RE: Sharding Techniques
Date Thu, 12 May 2011 18:48:33 GMT
Hi Samar,

Have you looked at top or iostat or other monitoring utilities to see if you are cpu bound
vs I/O bound?

With 225 term queries, it's possible that you are I/O bound.

I suspect you need to think about seek time and caching. For each unique field:term combination
lucene has to look up the postings for that term in the index.  Additionally for any phrase,
lucene has to additionally look up the positions data for each term in the phrase. (In our
case phrase searches are very expensive as our positions (*prx) index is about 8 times as
large as our frq index) So for 225 terms including some number of phrases, that is a lot of
disk seeks.  To the extent that the terms are close together in the index and various buffer
caches contain adjacent terms, you might not actually have 225 seeks, but I suspect there
will still be a lot.    

Although Lucene implements a number of caches (and you should take a look at your cache hit
ratios), Lucene depends on the OS disk cache to cache postings data for individual terms.
Most unix/linux OS's use free memory for disk caching.  How much memory is available on the
machine after the JVM gets it allocation?

Have you considered running cache warming queries of your most frequent terms/phrases so that
the data is in the OS disk cache?  


>> When queries (without two fields mentioned above) have a lot of
>>words/phrases search time is high. E.g I took a query with around 80 unique
>>terms (not words) in 5 fields. These terms occur repeatedly and become total
>>225 terms (non-unique). This particular query took 4.2 seconds. the 15
>>indexes used for this query were of total size 5 G.
>>Are 225 terms (80 unique) is a very big number?

-----Original Message-----

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message