lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vic Bancroft <bancr...@america.net>
Subject Re: Include BM25 in Lucene?
Date Thu, 19 Oct 2006 12:27:34 GMT
Chuck Williams wrote:

>Vic Bancroft wrote on 10/17/2006 02:44 AM:
>  
>
>>In some of my group's usage of lucene over large document collections,
>>we have split the documents across several machines.  This has lead to
>>a concern of whether the inverse document frequency was appropriate,
>>since the score seems to be dependant on the partioning of documents
>>over indexing hosts.  We have not formulated an experiment to
>>determine if it seriously effects our results, though it has been
>>discussed.
>>    
>>
>What version of Lucene are you using?  
>
The current systems are based on 1.9.1, though I suspect we should clean 
up the deprecation warnings and move to 2.0.0.

>Are you using ParallelMultiSearcher to manage the distributed indexes or have you
>implemented your own mechanism?  
>
We had started with the ParallelMultiSearcher, but did not see 
appropriate scalability with high numbers of concurrent requests.  The 
bottleneck was on the reduce side, folding results back together.  The 
first cut mechanism we implemented allows for a configurable 
distribution of front end processors and is extremely efficient at the 
cost of (over) simplification.

Perhaps it is time to investigate the hadoop path . . .

>There was a bug a couple years ago, in the 1.4.3 version as I recall, where 
>ParallelMultiSearcher was not computing df's appropriately, but that has been
>fixed for a long time now.  The df's are the sum of the df's from each 
>distributed index and thus are independent of the partitioning.
>  
>
Interesting, we randomly spray the documents across the leaf node 
indexers and rely on a tendancy of large numbers of documents to smooth 
out the probability distributions.   Hence my interest in participating 
in an effort to implement and evaluate the impact of using a different 
method, such as BM25 or perhaps even some DFR approach [1].

more,
l8r,
v

-- 
"The future is here. It's just not evenly distributed yet."
 -- William Gibson, quoted by Whitfield Diffie

[1] http://ir.dcs.gla.ac.uk/terrier/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message