lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Burton-West <>
Subject Re: Can you use reduced sized test indexes to predict performance gains for a larger index?
Date Mon, 15 Feb 2010 19:17:09 GMT

Hi Chris,

In our experience with large indexes (about 200-300GB) , we found most of
our bottlenecks involved disk I/O.  We found that if our experimental
indexes were too small, that much  of the index could fit in cache, and so
our test results  were not applicable to our larger indexes.  On the other
hand, once we started building our test indexes so they were significantly
larger than the amount of memory available for OS disk caching, we could see
results that extrapolated out to the large index. 

Tom Burton-West

ryguasu wrote:
> I'd like to try some experiments to see if I can improve search
> performance by changing analysis (e.g. adding/removing word bigrams or
> commongrams), or by changing how I map my source records into Lucene
> documents. The problem is that my index currently is about 1TB in size
> and takes about 2-3 weeks to build, so if I have to rebuild the entire
> index in order to test each potential improvement, then I'm going to
> be waiting around a lot.
> One option is to test potential performance improvements by building
> indexes not for the full dataset, but rather for, say, a 1% sample of
> the full dataset. (That is, I'll just index 1% of the source records.)
> I would build one small control index, and then n small test indexes,
> one for each intervention I wish to try. The hope would be that, if an
> indexing intervention significantly improves performance for the small
> indexes, then it would also significantly improve performance of the
> full dataset. (Similarly, you'd hope that if an intervention *didn't*
> significantly improve performance on the small indexes, then it would
> *not* significantly improve performance of the full dataset.) This
> would allow me to quickly accept and reject interventions (as least
> provisionally), and only try applying the most obviously promising
> ones to the full dataset.
> Any thoughts on how naive this is? Does it sound more like a way to
> save time, or like a way to waste time misleading myself?
> Cheers,
> Chris
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message