lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Removing search results that fall within a time range
Date Wed, 24 May 2006 01:49:17 GMT

A pretty big variable here in trying to find a "clever" solution to your
problem is: how many results do you want?

Do you need all of them for some sort of downstream processing, or are you
only interested in the first M? ... how big is M?

Assuming M is something managable, i would try writing a HitCollector
that maintains a bounded, sorted, list of (doc,date) pairs (sorted
on the date).

when you collect a new match X, you scan the list looking for for any item
I such that X.date-M <= I.date || I.date <= X.date+M .. for all things you
find that meet that critera (they should all be in a clump since teh list
is sorted) remove all but the one with the lowest date, and then either
replace that one with X, or throw away X if it's not the lowest date (ie:
collecting B after A and C in your A B C example below)

One thing to watch out for: make sure your bounded, sorted, list is
bounded by M+1, and then throw away the last item when you are done ..
if you limit it to M items you might fill up and start ignoring items
outside of the range of the list, and then the last doc you collect might
be like "B" and cause two items to be removed, leaving you with one less
result then you wanted.


: Date: Tue, 23 May 2006 17:38:04 -0400
: From: Benjamin Stein <ben@shadowtv.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Removing search results that fall within a time range
:
: I have a requirement to only return one result for all documents whose
: timestamps fall within N seconds of one another. (where timestamp is a
: field and N is an integer).
:
: For example, Document A is timestamped "12:00:00" and Document B has
: timestamp "12:00:30", Document B should be discarded.  On the other
: hand, if Document B has timestamp "12:01:00" then I should return both
: (assuming 30 < N < 59 seconds).
:
: Similarly, if Documents A, B, and C have timestamps "12:00:00",
: "12:00:30", and "12:01:00" respectively, only Document A should be
: returned (because B is close to A, and C is close to B).
:
: If it helps to simplify things, we can assume results are sorted by
: time.  Also, I can apply logic at index time or at search time.
:
: Any suggestions?  This is a pretty tough concept to search the archives
: for...
:
: --Ben
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message