lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dion Almaer" <d...@almaer.com>
Subject RE: Dates and others
Date Mon, 01 Dec 2003 22:13:24 GMT
 

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@lucene.com] 
> Sent: Monday, December 01, 2003 1:11 PM
> To: Lucene Users List
> Subject: Re: Dates and others
> 
> Dion Almaer wrote:
> > The only real item that I still want to tweak more is 
> getting recent results higher in the list.
> > 
> > I was wondering if something like this could work (or if there is a 
> > better solution)
> > 
> > At index time, I have the date of the content.  I could do 
> some math 
> > where the higher the date (based on the time_t version or whatever) 
> > the more of a setBoost(metric). Or, for every month in the 
> past, create a larger negative number to setBoost()... or 
> something like that.
> > 
> > Would something like this make sense?
> 
> The problem with this approach is that eventually you'll 
> exhaust the range of the boost.  So this will only work if 
> you re-index things from scratch periodically, with a boost 
> of something like 1/days-ago.
> 
> If you're adding documents to the index in date order, then 
> you could use a HitCollector which adjusts scores according 
> to the document number, since document numbers increase as 
> you add to the index.
> 
> If you're not adding things in date order, then you can, when 
> you open the index, build an array mapping document numbers 
> to integer dates. 
> Then your hit collector can use this to either boost or sort 
> hits by date.
> 
> Or you could add a "month" or "week" field to documents, then 
> add it as a clause to your queries with a boost.  Then 
> documents matching the most recent week(s) and/or month(s) 
> would get the boost.
> 
> Doug

Interesting.  I implemented an approach which boosted based on the number of months in the
past, and
after tweaking the boost amounts, it seems to do the job. I do a fresh reindex every night
(since
the indexing process takes no time at all... unlike our old search solution!)

I read content for the index from different sources. Sometimes the source gives me documents
loosely
in date order, but not all of them. So, it seems that one of the other approaches should be
taken
(adding a month/week field etc).  I should look more into the HitCollector and see how it
can help
me.

The other issue I have is that I would like to prioritize the title field.  At the moment
I am lazy
and add the title to the body (contents = title + body) which seems to be OK... however sometimes
something that mentions the search term in the title should appear higher up in the pecking
order.

I am using the QueryParser (subclassed to disallow wildcards etc) to do the dirty work for
me.
Should I get away from this and manage the queries myself (and run a Multi against the title
field
as well as the contents?

Thanks for the great feedback,

Dion


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message