lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Classifier4J and Lucene
Date Sun, 23 Oct 2005 17:14:13 GMT

: Not sure if this makses sense...but curious if anyone has ideas, or has
: done something like this.

I have a few ideas, none of which are mutuallly exclusive...

1) look at the Explain output for the various queries you are generating
to help you understand why your boosts aren't having as much of an affect
as you want.

2) subclass DefaultSimilarity and override the lengthNorm method with a
new one, which *heavilly* penilizes really long field values.  this method
gets the name of the field when asked to perform a calculation, so you can
use this special behavior just on fields that users have the ability to
keyword SPAM if you want.

3) subclass DefaultSimilarity and override the tf(float) method ... this
alllows you to specify how much of an impact the frequency any Term has on
the overall score.  Usually, high frequency itmes are given a high score
... but if you are dealing with records which are typically very small,
you may want to penalize docs with a high frequency.  at the very least,
you might want to flatten the curve.  if you really want to flatten it so
that spamming does no good at all, you can use something like this...

        public float tf(float freq) {
            if (freq > 0.0f) return 1.0f;
            else return 0.0f;

...but that may be overkill, expperimentation should help you find a
happpy medium.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message