lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: best practice: 1.4 billions documents
Date Mon, 22 Nov 2010 17:05:50 GMT
A local multithreaded search can be done in another way even for a single index, but not using
the impl of (Parallel)MultiSearcher. This may be a new class directly extending IndexSearcher,
which may even do parallel search on e.g. different segments (because searching a MultiReader
is no longer different from DirectoryReader, as segments of a single Lucene index are internally
also handled like MultiReaders).

The problem with MultiSearcher is the way how query rewriting is handled, because soring needs
every subsearcher docing the same query.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: David Fertig [mailto:dfertig@cymfony.com]
> Sent: Monday, November 22, 2010 5:57 PM
> To: java-user@lucene.apache.org
> Subject: RE: best practice: 1.4 billions documents
> 
> > it has only problems.
> Perhaps these known problems should be added to the doc api, so users who
> are encouraged to start clean with the 3.x API don't build bad applications from
> scratch?
> 
> Parallel searching is extremely powerful and should not be abandoned.
> 
> 
> 
> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Monday, November 22, 2010 11:19 AM
> To: java-user@lucene.apache.org
> Subject: RE: best practice: 1.4 billions documents
> 
> There is no reason to use MultiSearcher instead the much more consistent and
> effective  MultiReader! We (Robert and me) are already planning to deprecate
> it. MultiSearcher itsself has no benefit over a simple IndexSearcher on top of a
> MultiReader since Lucene 2.9, it has only problems.
> 
> Use cases for real MultiSearchers are only the subclasses for "remote search"
> or (perhaps) multi-threaded search, but the latter I would not recommend
> (instead let the additional CPUs in your machine be free for other users doing
> searches in parallel). Multithreading a single search should not be done, as it
> slows down multiple users accessing the same index at the same time. Spend
> the additional CPU power for other things like warming searchers, indexing
> additional documents, or filling FieldCache in parallel.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: David Fertig [mailto:dfertig@cymfony.com]
> > Sent: Monday, November 22, 2010 4:54 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: best practice: 1.4 billions documents
> >
> > >> We have a couple billion docs in our archives as well...Breaking
> > >> them up by day worked well for us
> >
> > We do not have 2 billion segments in one index  We have roughly 5-10
> > million documents per index. We are currently using a miltisearcher
> > but unresolved lucene issues in this will force us to move to a multireader.
> >
> > As far as the parallel searcher goes, read back on the thread with
> > subject "Search returning documents matching a NOT range".
> > There is an acknowledged/proven bug with a small unit test, but there
> > is some disagreement about the internal reasons it fails. I have been
> > unable to generate further discussion or a resolution. This was
> > supposed to be added as a bug to the JIRA for the 3.3 release, but has
> > not been.  I am not which class Solr uses, but if it uses MultiSearcher, it will
> have the same bug.
> >
> > -----Original Message-----
> > From: Luca Rondanini [mailto:luca.rondanini@gmail.com]
> > Sent: Monday, November 22, 2010 1:47 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: best practice: 1.4 billions documents
> >
> > Hi David, thanks for your answer. it really helped a lot! so, you have
> > an index with more than 2 billions segments. this is pretty much the
> > answer I was searching for: lucene alone is able to manage such a big index.
> >
> > which kind of problems do you have with the parallel searchers? I'm
> > going to build my index in the next couple of weeks if you want we can
> > confront our data
> >
> > thanks again
> > Luca
> >
> >
> > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfertig@cymfony.com>
> wrote:
> >
> > > Actually I've been bitten by an still-unresolved issue with the
> > > parallel searchers and recommend a MultiReader instead.
> > > We have a couple billion docs in our archives as well.  Breaking
> > > them up by day worked well for us, but you'll need to do something.
> > >
> > > -----Original Message-----
> > > From: Luca Rondanini [mailto:luca.rondanini@gmail.com]
> > > Sent: Sunday, November 21, 2010 8:13 PM
> > > To: java-user@lucene.apache.org; yonik@lucidimagination.com
> > > Subject: Re: best practice: 1.4 billions documents
> > >
> > > thank you both!
> > >
> > > Johannes, katta seems interesting but I will need to solve the
> > > problems of "hot" updates to the index
> > >
> > > Yonik, I see your point - so your suggestion would be to build an
> > > architecture based on ParallelMultiSearcher?
> > >
> > >
> > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley
> > > <yonik@lucidimagination.com
> > > >wrote:
> > >
> > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > > > <luca.rondanini@gmail.com> wrote:
> > > > > Hi everybody,
> > > > >
> > > > > I really need some good advice! I need to index in lucene
> > > > > something
> > > like
> > > > 1.4
> > > > > billions documents. I had experience in lucene but I've never
> > > > > worked
> > > with
> > > > > such a big number of documents. Also this is just the number of
> > > > > docs at
> > > > > "start-up": they are going to grow and fast.
> > > > >
> > > > > I don't have to tell you that I need the system to be fast and
> > > > > to
> > > support
> > > > > real time updates to the documents
> > > > >
> > > > > The first solution that came to my mind was to use
> > > ParallelMultiSearcher,
> > > > > splitting the index into many "sub-index" (how many docs per index?
> > > > > 100,000?) but I don't have experience with it and I don't know
> > > > > how well
> > > > will
> > > > > scale while the number of documents grows!
> > > > >
> > > > > A more solid solution seems to build some kind of integration
> > > > > with
> > > > hadoop.
> > > > > But I didn't find match about lucene and hadoop integration.
> > > > >
> > > > > Any idea? Which direction should I go (pure lucene or hadoop)?
> > > >
> > > > There seems to be a common misconception about hadoop regarding
> > search.
> > > > Map-reduce as implemented in hadoop is really for batch oriented
> > > > jobs only (or those types of jobs where you don't need a quick
> > > > response time).  It's definitely not for normal queries (unless
> > > > you have unusual requirements).
> > > >
> > > > -Yonik
> > > > http://www.lucidimagination.com
> > > >
> > > > ------------------------------------------------------------------
> > > > --
> > > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message