lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tsuraan <>
Subject Re: Batch searching
Date Wed, 22 Jul 2009 19:23:47 GMT
> Out of curiosity, what is the size of your corpus?  How much and how
> quickly do you expect it to grow?

in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, but merging takes a
long time so I'm trying to test out just using a ton of tiny indices
to see if the search penalty from doing that is worth the time savings
from not having to build and optimize large indices.

> I'm just trying to make sure that we are all on the same page here ^^
> I can see the benefits of doing what you are describing with a very
> large corpus that is expected to grow at quick rate, but if that's not
> really your use case, then perhaps it might be worth investigating if a
> simpler solution would serve you just as well.

The indices also grow pretty quickly.  We have some cases where we get
nearly a million new documents per day.  I haven't looked at those
machines for quite a while, but I guess they'd probably have well over
a hundred million documents, and still are growing.  We also don't
have a lot of simultaneous searches yet, but that's changing, so I'm
getting concerned about how well that's being handled.  We expect that
we will soon be dealing with tens to hundreds of searches being
executed simultaneously.

> In the example you provided, you are only talking about searching
> against 1M documents, which I can guarantee will search with VERY good
> performance in a single properly setup lucene index.
> Now if we are talking more on the order of... 100M or more documents you
> may be onto something.

But in any case, there isn't currently any framework for making
multiple searches simultaneously use an index in a coordinated
fashion.  I was pretty much just planning on adding it to my tests if
such a thing existed.  Since it doesn't, I guess I'll stuck to
searching in parallel and hoping that the Linux VFS layer is smart
enough to keep things fast until I have time to try putting something
together myself.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message