lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <>
Subject Re: Combining score from two or more hits
Date Fri, 23 Mar 2007 08:01:57 GMT
Chris Hostetter wrote:
> if you are using a HitCollector, there any re-evaluation is going to
> happen in your code using whatever mechanism you want -- once your collect
> method is called on a docid, Lucene is done with that docid and no longer
> cares about it ... it's only whatever storage you may be maintaining of
> high scoring docs thta needs to know that you've decided the score has
> changed.
> your big problem is going to be that you basically need to maintain a list
> of *every* doc collected, if you don't know what the score of any of them
> are until you've processed all the rest ... since docs are collected in
> increasing order of docid, you might be able to make some optimizations
> based on how big of a gap you've got between the doc you are currently
> collecting and the last doc you've collected if you know that you're
> always going to add docs that "relate" to eachother in sequential bundles
> -- but this would be some very custom code depending on your use case.

I only ever need to return a couple of ID fields per doc hit, so I load them 
with FieldCache when I start a new searcher.  These IDs refer to unique objects 
elsewhere, but there can be one or more instances of the same Id in the index 
due to the way I've structured Documents.  A Document = an attachment in the 
other system attached to the other system's object which can have 1...n 
attachments.  My problem is I need to return only unique external Ids with some 
kind of combined score up to the requested maxHits from the client.

Getting the unique Ids is no problem, but as you say I either have to store all 
hits and then sort them by score at the end once I know all unique docs, or do 
some clever stuff with some type of PriorityQueue that allows me to re-jig 
scores that already exist in the sorted queue.

One idea your comments raise is the relationship of docids to the group of 
Documents added for the higher level object.  All the Documents for the external 
object are added with a single writer at index time.  Assuming that the 
Documents for a single external Id will either all exist or none, then will the 
doc ids always be sequential for ever for that external Id or will they 
'reorganise' themselves?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message