lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: An IDF variation with penalty for very rare terms
Date Wed, 13 Apr 2011 19:36:09 GMT
On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
> Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
> Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
> Traditional log(N/x) tail, but when nearing zero freq, instead of
> going to +inf you do a nice round bump (with controlled
> height/location/sharpness) and drop down to -inf (or zero).
 
I haven't used that technique, nor can I quote academic literature blessing
it.  Nevertheless, what you're doing makes sense makes sense to me.

> Rationale is that - most good, discriminating terms are found in at
> least a certain percentage of your documents, but there are lots of
> mostly unique crapterms, which at some collection sizes stop being
> strictly unique and with IDF's help explode your scores.

So you've designed a heuristic that allows you to filter a certain kind of
noise.  It sounds a lot like how people tune length normalization to adapt to
their document collections.  Many tuning techniques are corpus-specific.
Whatever works, works!

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message