lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Does Lucene Supports Billions of data
Date Fri, 02 May 2008 08:36:14 GMT
>> If your terms are roughly equally distributed in all N indices(e.g. random doc->index/shard
assignment), the relevance score willroughly match.


Agreed. I did some formal benchmarking of local IDF vs global IDF relevance ranking recently.
I measured the movement of the top ranked document in a single index's results (global IDF)
vs the same document's position in results merged from 2 remote indexes with randomized doc->shard
assignment (a local IDF scheme). This distance was measured for a large number of real-world
queries.
Results were very promising - the distributed ranking scheme very rarely differed from that
of the single large index.

----- Original Message ----
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
To: java-user@lucene.apache.org
Sent: Friday, 2 May, 2008 1:35:04 AM
Subject: Re: Does Lucene Supports Billions of data

Right.  And the typical answer to that is:

- If your terms are roughly equally distributed in all N indices (e.g. random doc->index/shard
assignment), the relevance score will roughly match.

- If you have business rules for doc->index/shard distribution, then your relevance scores
will not be comparable.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Toke Eskildsen <te@statsbiblioteket.dk>
> To: java-user@lucene.apache.org
> Sent: Friday, May 2, 2008 12:13:04 AM
> Subject: Re: Does Lucene Supports Billions of data
> 
> From: John Wang 
> [...]
> > sub index 1: 1 billion docs
> > sub index 2: 1 billion docs
> > sub index 3: 1 billion docs
> > 
> > federating search to these subindexes, you represent an index of 3 
> > billiondocs, and all internal doc ids are of type int.
> 
> That falls under Daniel's "...unless you wrap your own framework around it". The 
> problem with the solution you're describing is that it's not functionally 
> equivalent to a single index of 3 billion docs.
> 
> If you just create 3 independent indexes and merge the top hits from all 3, the 
> ranking of the documents will be messed up. You'll need to make sure that the 
> scores from the different indexes can be compared. That's tricky when the score 
> depends on the frequency of the terms in the whole corpus.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message