lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: An IDF variation with penalty for very rare terms
Date Thu, 14 Apr 2011 14:03:46 GMT
indeed, frequency usage  is collection and use case dependant...
Not directly your case, but the idea is the same.

We used this information in spell/typo-variations context to
boost/penalize similarity, by dividing terms into a couple of freq
based segments.

Take an example:
Maria - Very High Freq
Marina - Very High Freq
Mraia - Very Low Freq

similarity(Maria, Marina) is by string distance measures very high,
practically the same as (Maria, Mraia) but the likelihood that you
mistyped Mraia is an order of magnitude higher than if you hit VHF-VHF

Point being, frequency hides a lot of semantics, and how you tune it,
as Martin said, does not really matter, if it works.

We also never found theory that formalize this, but it was logical,
and it worked in practice.

What you said, makes sense to me, especially for very big collections
(or specialized domains with limited vocabulary...) the bigger the
collection, the bigger "garbage density" in VLF domain (above certain
size of the collection). If  "vocabulary" in your collection is
somehow limited, there is a size limit where most of new terms (VLF)
are "crapterms". One could try to  estimate how "saturated" a
collection is...


On Wed, Apr 13, 2011 at 9:36 PM, Marvin Humphrey <> wrote:
> On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
>> Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
>> Something that looks like like
>> Traditional log(N/x) tail, but when nearing zero freq, instead of
>> going to +inf you do a nice round bump (with controlled
>> height/location/sharpness) and drop down to -inf (or zero).
> I haven't used that technique, nor can I quote academic literature blessing
> it.  Nevertheless, what you're doing makes sense makes sense to me.
>> Rationale is that - most good, discriminating terms are found in at
>> least a certain percentage of your documents, but there are lots of
>> mostly unique crapterms, which at some collection sizes stop being
>> strictly unique and with IDF's help explode your scores.
> So you've designed a heuristic that allows you to filter a certain kind of
> noise.  It sounds a lot like how people tune length normalization to adapt to
> their document collections.  Many tuning techniques are corpus-specific.
> Whatever works, works!
> Marvin Humphrey
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message