lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <>
Subject Re: Problem with search
Date Wed, 14 Apr 2010 08:55:09 GMT
I don't know if that proposal is the most efficient one, but you can try it.
In general, what you're looking for is a GROUP BY Bill-Id feature and then
select the most recent one, right? Only you don't need all the Versions of
the same Bill, and therefore you can hold the most recent Version-Id only.
What you can do is write a Collector which for each received document checks
its Bill-Id and Version-Id. It keeps a Map Bill-Id -> Version-Id and for
every incoming doc checks the map:
1) If the Bill-Id hasn't been seen yet, stores it in the map.
2) If it has been seen, compares the Version-Id of the incoming doc to the
one in the map and replaces them if needed.

By storing the Bill-Id and Version-Id in the FieldCache you can make that
Collector work very fast. Also, you can apply some optimization to the
process by e.g. not checking the map if the document has no chance in being
selected for the top-K requested docs (for e.g. a low score) etc.

I've outlined a general approach .. other, perhaps more efficient ones, may

Another alternative is to run your search, collecting top-NK, where N is a
factor/multiplier you activate on K. After the search is done, you filter
out the unneeded docs w/ "old" Version-Id. If you choose your N smartly,
you'll do it just once, not re-running the query in case it filtered out too
many docs.

Hope this helps,

On Tue, Apr 13, 2010 at 11:59 PM, Sirish Vadala <>wrote:

> Hello All,
> I am kind of new to Lucene, and having problem filtering search results.
> Background:
> My Indexed documents have multiple bills and each bill has multiple
> versions.
> Each version of the same bill has a different bill Version Id, but the same
> bill Id. In most likely case, the text in different versions varies only
> slightly. The text for all these versions indexed.
> Problem:
> Lets say, for a particular search term, if it is present in one version of
> the bill, in most cases it is present in all other versions too. So the
> users have come up with a requirement stating that they would like to see
> only the latest bill version for the same bill having this search term.
> So when I perform a search for a particular word, I might get different
> versions of the same bill, but have to display only the latest record for
> that bill. I did some research and understood that filters could be used to
> implement this kind of requirement, however I am not sure how to proceed.
> Any hints on how to implement this would be highly appreciated.
> Thanks.
> --
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message