lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Fertig" <dfer...@cymfony.com>
Subject RE: best practice: 1.4 billions documents
Date Mon, 22 Nov 2010 15:53:50 GMT
>> We have a couple billion docs in our archives as well...Breaking them up by day worked
well for us

We do not have 2 billion segments in one index  We have roughly 5-10 million documents per
index. We are currently using a miltisearcher but unresolved lucene issues in this will force
us to move to a multireader.

As far as the parallel searcher goes, read back on the thread with subject "Search returning
documents matching a NOT range".
There is an acknowledged/proven bug with a small unit test, but there is some disagreement
about the internal reasons it fails. I have been unable to generate further discussion or
a resolution. This was supposed to be added as a bug to the JIRA for the 3.3 release, but
has not been.  I am not which class Solr uses, but if it uses MultiSearcher, it will have
the same bug.

-----Original Message-----
From: Luca Rondanini [mailto:luca.rondanini@gmail.com] 
Sent: Monday, November 22, 2010 1:47 AM
To: java-user@lucene.apache.org
Subject: Re: best practice: 1.4 billions documents

Hi David, thanks for your answer. it really helped a lot! so, you have an
index with more than 2 billions segments. this is pretty much the answer I
was searching for: lucene alone is able to manage such a big index.

which kind of problems do you have with the parallel searchers? I'm going to
build my index in the next couple of weeks if you want we can confront our
data

thanks again
Luca


On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfertig@cymfony.com> wrote:

> Actually I've been bitten by an still-unresolved issue with the parallel
> searchers and recommend a MultiReader instead.
> We have a couple billion docs in our archives as well.  Breaking them up by
> day worked well for us, but you'll need to do something.
>
> -----Original Message-----
> From: Luca Rondanini [mailto:luca.rondanini@gmail.com]
> Sent: Sunday, November 21, 2010 8:13 PM
> To: java-user@lucene.apache.org; yonik@lucidimagination.com
> Subject: Re: best practice: 1.4 billions documents
>
> thank you both!
>
> Johannes, katta seems interesting but I will need to solve the problems of
> "hot" updates to the index
>
> Yonik, I see your point - so your suggestion would be to build an
> architecture based on ParallelMultiSearcher?
>
>
> On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley <yonik@lucidimagination.com
> >wrote:
>
> > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > <luca.rondanini@gmail.com> wrote:
> > > Hi everybody,
> > >
> > > I really need some good advice! I need to index in lucene something
> like
> > 1.4
> > > billions documents. I had experience in lucene but I've never worked
> with
> > > such a big number of documents. Also this is just the number of docs at
> > > "start-up": they are going to grow and fast.
> > >
> > > I don't have to tell you that I need the system to be fast and to
> support
> > > real time updates to the documents
> > >
> > > The first solution that came to my mind was to use
> ParallelMultiSearcher,
> > > splitting the index into many "sub-index" (how many docs per index?
> > > 100,000?) but I don't have experience with it and I don't know how well
> > will
> > > scale while the number of documents grows!
> > >
> > > A more solid solution seems to build some kind of integration with
> > hadoop.
> > > But I didn't find match about lucene and hadoop integration.
> > >
> > > Any idea? Which direction should I go (pure lucene or hadoop)?
> >
> > There seems to be a common misconception about hadoop regarding search.
> > Map-reduce as implemented in hadoop is really for batch oriented jobs
> > only (or those types of jobs where you don't need a quick response
> > time).  It's definitely not for normal queries (unless you have
> > unusual requirements).
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
Mime
View raw message