lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: How to properly correlate relevance in a search across multiple collections
Date Sat, 06 Sep 2014 18:10:27 GMT
An observation: df and IDF (document frequency) is a key driver of the whole 
relevancy framework on which stock Lucene is based. There is no question 
about its significant value. But... that means that we can't blindly 
"correlate" relevancy between "collections", in large part because the 
document scores are so heavily driven by df, which is distinctly based on 
the specific corpus of each collection.

My modest proposal: As valuable as df-based relevancy is, offer an easy to 
use "switch" to drop back to a pure tf-based relevancy score (primarily tf, 
but it can include other factors, but simply limited to the contents of the 
document itself) to sidestep these corpus-dependent scores. In other words, 
the score of the document could depend on only the contents of the document 
itself, not the corpus. Yes, that's a major loss of relevance, but the 
benefits for operations in a multi-corpus, distributed world can be 

Yes, you can do this yurself by just plugging in your own custom 
"similarity" class, but it should be offered as a much easier to use 
"switch" for Lucene itself (and Solr too!)

The alternative is to have some mechanism to define and work with a 
"super-corpus" or "super-collection" that integrates the df for multiple 
corpuses, but... df is calculated or updated for the overall corpus, so a 
cross-corpus df would require recalculating df for all terms in the index 
whenever the multi-corpus structure changes, which can work in some cases, 
but not for things like distributed searches for Solr. That might be a 
superior solution, but might now be so easy or as performant as a simple 
non-df similarity approach.

It might also be nice for apps to offer users pure-tf scoring if it provides 
faster search results, and then the user could click on a "refine results" 
button to re-do the search with the more expensive cross-corpus df-based 


-- Jack Krupansky

-----Original Message----- 
From: Baldwin, David
Sent: Friday, September 5, 2014 8:05 PM
Subject: How to properly correlate relevance in a search across multiple 

I have a project where there are multiple collections - could be dozens at 
times that a single results set needs to be generated by applying the same 
search criteria to each collection directory and then correlating all the 
sub searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share some 
tid-bits or info I may not have run across yet?


To unsubscribe, e-mail:
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message