lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: best practice: 1.4 billions documents
Date Sun, 21 Nov 2010 23:48:03 GMT
On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
<luca.rondanini@gmail.com> wrote:
> Hi everybody,
>
> I really need some good advice! I need to index in lucene something like 1.4
> billions documents. I had experience in lucene but I've never worked with
> such a big number of documents. Also this is just the number of docs at
> "start-up": they are going to grow and fast.
>
> I don't have to tell you that I need the system to be fast and to support
> real time updates to the documents
>
> The first solution that came to my mind was to use ParallelMultiSearcher,
> splitting the index into many "sub-index" (how many docs per index?
> 100,000?) but I don't have experience with it and I don't know how well will
> scale while the number of documents grows!
>
> A more solid solution seems to build some kind of integration with hadoop.
> But I didn't find match about lucene and hadoop integration.
>
> Any idea? Which direction should I go (pure lucene or hadoop)?

There seems to be a common misconception about hadoop regarding search.
Map-reduce as implemented in hadoop is really for batch oriented jobs
only (or those types of jobs where you don't need a quick response
time).  It's definitely not for normal queries (unless you have
unusual requirements).

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message