lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject An IDF variation with penalty for very rare terms
Date Tue, 12 Apr 2011 21:01:09 GMT
Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
Traditional log(N/x) tail, but when nearing zero freq, instead of
going to +inf you do a nice round bump (with controlled
height/location/sharpness) and drop down to -inf (or zero).

Should be cool when doing cosine-measure(or something
comparable)-based document comparisons (eg. in a "more like this"
query, to mention Lucene at least once :) ), over dirty data.
Rationale is that - most good, discriminating terms are found in at
least a certain percentage of your documents, but there are lots of
mostly unique crapterms, which at some collection sizes stop being
strictly unique and with IDF's help explode your scores.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message