lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: best practice: 1.4 billions documents
Date Thu, 25 Nov 2010 07:58:27 GMT
ParallelMultiSearcher as subclass of MultiSearcher has the same problems. These are not crashes,
but more that some queries do not return correct scored results for some queries. This effects
especially all MultiTermQueries (TermRange, Fuzzy, NumericRange, Wildcard, Prefix) if they
are used in a negative fashion (using MUST_NOT resp. "-" in QueryParser). For all of those
queries except Fuzzy, you are safe if you use CONSTANT_SCORE_REWRITE_METHOD (using setRewriteMethod).
The same problems apply for span queries. For *all* Fuzzy Queries (negative or not), the scores
are simply wrong and so scoring is broken with (Parallel)MultiSearcher; wrong results are
only returned when negative clauses!

A new class ParallelIndexSearcher could help with that, when it parallelizes multiple segments,
this is still in planning phase. The difference to ParallelMultiSearcher would be that it
takes a "single" IndexReader (e.g. a MultiReader in your case) and parallelizes per segment/segment
bunches.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ganesh [mailto:emailgane@yahoo.co.in]
> Sent: Thursday, November 25, 2010 6:44 AM
> To: java-user@lucene.apache.org
> Subject: Re: best practice: 1.4 billions documents
> 
> Since there was a debate about using multisearcher, what about using
> ParallelMultiSearcher?
> 
> I am having indexes with 60 million documents and sometimes it grows to 100
> million. I shard the DB by week. I use ParallelMultiSearcher to search across
> the shards.  All data is in single system. Till now i didn't faced any issue. I used
> Lucene 2.9 and recently upgraded to 3.0.2.
> 
> Do i need to switch to MultiReader?
> 
> Regards
> Ganesh
> 
> 
> ----- Original Message -----
> From: "Luca Rondanini" <luca.rondanini@gmail.com>
> To: <java-user@lucene.apache.org>
> Sent: Monday, November 22, 2010 11:29 PM
> Subject: Re: best practice: 1.4 billions documents
> 
> 
> > Thank you all, I really got some good hints!
> >
> > of course I will distribute my index over many machines: store everything on
> > one computer is just crazy, 1.4B docs is going to be an index of almost 2T
> > (in my case)
> >
> > the best solution for me at the moment (from your suggestions) seems to
> > identify a criteria to force a request (search/update) to access only a
> > subset of the index. Multi or Parallel Searchers....I'll decide later.
> >
> > Solr is a really good option and I've already planned of "stealing" parts of
> > code but I have time and resources to try to build my own platform
> > especially since my data need heavy processing.
> >
> > I'll keep you posted
> > Luca
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 22, 2010 at 8:54 AM, eks dev <eksdev@yahoo.co.uk> wrote:
> >
> >> Am I the only one who thinks this is not the way to go, MultiReader (or
> >> MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on
> >> one machine is a big number, does not matter how you partition them (or
> you
> >> have some really expensive hardware at your disposal).  Did I miss the
> >> point
> >> somewhere with this recommendation "use MultiReader and you are good
> for
> >> 1.4B Document"?
> >>
> >> Imo, you must distribute your index across many machines.
> >>
> >> Your best chance is to look at solr cloud and solr replication (solr Wiki
> >> is
> >> your friend). Of course, you can do it yourself, but building distributet
> >> setup with what you call "real time updates" is a huge pain.
> >>
> >> Alternatively, google for lucene or solr on cassandra (has some very nice
> >> properties about update latency and architectural simplicity).I do not know
> >> if this is somewhere in production.
> >>
> >> Good luck,
> >> e.
> >>
> >>
> >>
> >> On Mon, Nov 22, 2010 at 5:18 PM, Uwe Schindler <uwe@thetaphi.de>
> wrote:
> >>
> >> > There is no reason to use MultiSearcher instead the much more consistent
> >> > and effective  MultiReader! We (Robert and me) are already planning to
> >> > deprecate it. MultiSearcher itsself has no benefit over a simple
> >> > IndexSearcher on top of a MultiReader since Lucene 2.9, it has only
> >> > problems.
> >> >
> >> > Use cases for real MultiSearchers are only the subclasses for "remote
> >> > search" or (perhaps) multi-threaded search, but the latter I would not
> >> > recommend (instead let the additional CPUs in your machine be free for
> >> other
> >> > users doing searches in parallel). Multithreading a single search should
> >> not
> >> > be done, as it slows down multiple users accessing the same index at the
> >> > same time. Spend the additional CPU power for other things like warming
> >> > searchers, indexing additional documents, or filling FieldCache in
> >> parallel.
> >> >
> >> > Uwe
> >> >
> >> > -----
> >> > Uwe Schindler
> >> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> > http://www.thetaphi.de
> >> > eMail: uwe@thetaphi.de
> >> >
> >> >
> >> > > -----Original Message-----
> >> > > From: David Fertig [mailto:dfertig@cymfony.com]
> >> > > Sent: Monday, November 22, 2010 4:54 PM
> >> > > To: java-user@lucene.apache.org
> >> > > Subject: RE: best practice: 1.4 billions documents
> >> > >
> >> > > >> We have a couple billion docs in our archives as well...Breaking
> >> them
> >> > > >> up by day worked well for us
> >> > >
> >> > > We do not have 2 billion segments in one index  We have roughly 5-10
> >> > million
> >> > > documents per index. We are currently using a miltisearcher but
> >> > unresolved
> >> > > lucene issues in this will force us to move to a multireader.
> >> > >
> >> > > As far as the parallel searcher goes, read back on the thread with
> >> > subject
> >> > > "Search returning documents matching a NOT range".
> >> > > There is an acknowledged/proven bug with a small unit test, but there
> >> is
> >> > some
> >> > > disagreement about the internal reasons it fails. I have been unable
to
> >> > > generate further discussion or a resolution. This was supposed to
be
> >> > added as a
> >> > > bug to the JIRA for the 3.3 release, but has not been.  I am not which
> >> > class Solr
> >> > > uses, but if it uses MultiSearcher, it will have the same bug.
> >> > >
> >> > > -----Original Message-----
> >> > > From: Luca Rondanini [mailto:luca.rondanini@gmail.com]
> >> > > Sent: Monday, November 22, 2010 1:47 AM
> >> > > To: java-user@lucene.apache.org
> >> > > Subject: Re: best practice: 1.4 billions documents
> >> > >
> >> > > Hi David, thanks for your answer. it really helped a lot! so, you
have
> >> an
> >> > index
> >> > > with more than 2 billions segments. this is pretty much the answer
I
> >> was
> >> > > searching for: lucene alone is able to manage such a big index.
> >> > >
> >> > > which kind of problems do you have with the parallel searchers? I'm
> >> going
> >> > to
> >> > > build my index in the next couple of weeks if you want we can confront
> >> > our
> >> > > data
> >> > >
> >> > > thanks again
> >> > > Luca
> >> > >
> >> > >
> >> > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfertig@cymfony.com>
> >> > wrote:
> >> > >
> >> > > > Actually I've been bitten by an still-unresolved issue with the
> >> > > > parallel searchers and recommend a MultiReader instead.
> >> > > > We have a couple billion docs in our archives as well.  Breaking
them
> >> > > > up by day worked well for us, but you'll need to do something.
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: Luca Rondanini [mailto:luca.rondanini@gmail.com]
> >> > > > Sent: Sunday, November 21, 2010 8:13 PM
> >> > > > To: java-user@lucene.apache.org; yonik@lucidimagination.com
> >> > > > Subject: Re: best practice: 1.4 billions documents
> >> > > >
> >> > > > thank you both!
> >> > > >
> >> > > > Johannes, katta seems interesting but I will need to solve the
> >> > > > problems of "hot" updates to the index
> >> > > >
> >> > > > Yonik, I see your point - so your suggestion would be to build
an
> >> > > > architecture based on ParallelMultiSearcher?
> >> > > >
> >> > > >
> >> > > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley
> >> > > > <yonik@lucidimagination.com
> >> > > > >wrote:
> >> > > >
> >> > > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> >> > > > > <luca.rondanini@gmail.com> wrote:
> >> > > > > > Hi everybody,
> >> > > > > >
> >> > > > > > I really need some good advice! I need to index in
lucene
> >> > > > > > something
> >> > > > like
> >> > > > > 1.4
> >> > > > > > billions documents. I had experience in lucene but
I've never
> >> > > > > > worked
> >> > > > with
> >> > > > > > such a big number of documents. Also this is just the
number of
> >> > > > > > docs at
> >> > > > > > "start-up": they are going to grow and fast.
> >> > > > > >
> >> > > > > > I don't have to tell you that I need the system to
be fast and to
> >> > > > support
> >> > > > > > real time updates to the documents
> >> > > > > >
> >> > > > > > The first solution that came to my mind was to use
> >> > > > ParallelMultiSearcher,
> >> > > > > > splitting the index into many "sub-index" (how many
docs per
> >> index?
> >> > > > > > 100,000?) but I don't have experience with it and I
don't know
> >> how
> >> > > > > > well
> >> > > > > will
> >> > > > > > scale while the number of documents grows!
> >> > > > > >
> >> > > > > > A more solid solution seems to build some kind of integration
> >> with
> >> > > > > hadoop.
> >> > > > > > But I didn't find match about lucene and hadoop integration.
> >> > > > > >
> >> > > > > > Any idea? Which direction should I go (pure lucene
or hadoop)?
> >> > > > >
> >> > > > > There seems to be a common misconception about hadoop
> regarding
> >> > > search.
> >> > > > > Map-reduce as implemented in hadoop is really for batch
oriented
> >> > > > > jobs only (or those types of jobs where you don't need a
quick
> >> > > > > response time).  It's definitely not for normal queries
(unless you
> >> > > > > have unusual requirements).
> >> > > > >
> >> > > > > -Yonik
> >> > > > > http://www.lucidimagination.com
> >> > > > >
> >> > > > >
> >> --------------------------------------------------------------------
> >> > > > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> > > > >
> >> > > > >
> >> > > >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message