lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Stein" <...@shadowtv.com>
Subject RE: Removing search results that fall within a time range
Date Tue, 23 May 2006 23:27:53 GMT
 

> -----Original Message-----
> From: karl wettin [mailto:kalle@snigel.net] 
> Sent: Tuesday, May 23, 2006 6:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Removing search results that fall within a time range
> 
> On Tue, 2006-05-23 at 17:38 -0400, Benjamin Stein wrote:
> > I have a requirement to only return one result for all 
> documents whose 
> > timestamps fall within N seconds of one another. (where 
> timestamp is a 
> > field and N is an integer).
> > 
> > For example, Document A is timestamped "12:00:00" and 
> Document B has 
> > timestamp "12:00:30", Document B should be discarded.  On the other 
> > hand, if Document B has timestamp "12:01:00" then I should 
> return both 
> > (assuming 30 < N < 59 seconds).
> > 
> > Similarly, if Documents A, B, and C have timestamps "12:00:00", 
> > "12:00:30", and "12:01:00" respectively, only Document A should be 
> > returned (because B is close to A, and C is close to B).
> > 
> > If it helps to simplify things, we can assume results are sorted by 
> > time.  Also, I can apply logic at index time or at search time.
> > 
> > Any suggestions?  This is a pretty tough concept to search the 
> > archives for...
> 

> How big is the corpus and how many hits do you estimate a 
> search can result in? Can you just take the penalty from 
> iterating the hits?
> 

The corpus is very big.  Approximately 300,000,000 documents and
growing.  I would estimate potentially a huge number of hits per search.

We currently do iterate through the hits and process them like you
suggest, but that requires some impressive kludges to work :)  Just
wondering if there was a clever way to push this logic into the
index/search process.

My other plan was to create a class that implements Searchable
interface.  This class will just forward all search requests to a
private IndexSearcher data member and post-process the results before
returning.  I will then pass an array of these customized searchers to a
ParallelMultiSearcher.  Given enough parallel processing, this might
work in a reasonable timeframe.  



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message