lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Does Lucene Supports Billions of data
Date Fri, 02 May 2008 00:35:04 GMT
Right.  And the typical answer to that is:

- If your terms are roughly equally distributed in all N indices (e.g. random doc->index/shard
assignment), the relevance score will roughly match.

- If you have business rules for doc->index/shard distribution, then your relevance scores
will not be comparable.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Toke Eskildsen <te@statsbiblioteket.dk>
> To: java-user@lucene.apache.org
> Sent: Friday, May 2, 2008 12:13:04 AM
> Subject: Re: Does Lucene Supports Billions of data
> 
> From: John Wang 
> [...]
> > sub index 1: 1 billion docs
> > sub index 2: 1 billion docs
> > sub index 3: 1 billion docs
> > 
> > federating search to these subindexes, you represent an index of 3 
> > billiondocs, and all internal doc ids are of type int.
> 
> That falls under Daniel's "...unless you wrap your own framework around it". The 
> problem with the solution you're describing is that it's not functionally 
> equivalent to a single index of 3 billion docs.
> 
> If you just create 3 independent indexes and merge the top hits from all 3, the 
> ranking of the documents will be messed up. You'll need to make sure that the 
> scores from the different indexes can be compared. That's tricky when the score 
> depends on the frequency of the terms in the whole corpus.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message