lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <>
Subject An IDF variation with penalty for very rare terms
Date Tue, 12 Apr 2011 21:01:09 GMT
Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
Something that looks like like
Traditional log(N/x) tail, but when nearing zero freq, instead of
going to +inf you do a nice round bump (with controlled
height/location/sharpness) and drop down to -inf (or zero).

Should be cool when doing cosine-measure(or something
comparable)-based document comparisons (eg. in a "more like this"
query, to mention Lucene at least once :) ), over dirty data.
Rationale is that - most good, discriminating terms are found in at
least a certain percentage of your documents, but there are lots of
mostly unique crapterms, which at some collection sizes stop being
strictly unique and with IDF's help explode your scores.

Kirill Zakharenko/Кирилл Захаренко
Phone: +7 (495) 683-567-4
ICQ: 104465785

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message