lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Rodenburg <jeff.rodenb...@gmail.com>
Subject Re: Classifier4J and Lucene
Date Sun, 23 Oct 2005 16:25:11 GMT
Sounds like you might have to consider both, if the first one doesn't solve
your issue. A company field sounds like it's a single entry, i.e. one that
can't be "spammed up" with multiple terms, i.e. "Oralce Oracle Oracle". It
also sounds as if you're searching multiple fields, and that some fields are
more important than others.

It sounds like there are expectations about what documents rise to the top
for a given search, so I would suggest starting by getting your boost
prioritization in order by working with a "clean" or non-spammed index.
After that, bring in the spammed index and go from there. You're right, you
won't be able to boost away the spammers.

I don't have much background with Classifier4j, but it seems that words
would need to be considered spam differently across different fields, if I
understand your indexing/querying structure. I like the approach of indexing
a boiled summary, not sure if Classifier4J doesn't have you doing a lot of
work.

Hope this helps.

-- jr


On 10/23/05, msftblows@aol.com <msftblows@aol.com> wrote:
>
> Hey-
>
> I have an indexer at my company that I wrote while back that indexes
> database content (users and their profile)...one of the next req. of the
> project is to avoid 'spam' in hits. For example if I do a search for oracle,
> and oracle is in 25 places in someones bio field...and another person has it
> in one place in his company field, the 25 places will of course be higher.
> Unfortunatly, people who know the system know the more you have certain
> keywords in you user profile, the higher you will be on the list. I was
> thinking I can do one of two things:
>
> 1. Work with Lucene algo to lower scores in certain fields (boost in
> others)...this would work, but the boost has such a small factor in scoring
> (or so it seems), that in some cases it won't matter. (if I boost company to
> 2.0, and bio to 1.0 in some cases with xxx hits in bio, that is still
> first in score)
>
> 2. Using Classifier4J (http://classifier4j.sourceforge.net/)...I can use
> same idea as a mail filter and use the Bayesian Classifier to train it that
> certain words would be spam...then just index the summary. Throwing this out
> there...not even sure that it will work...
>
> Not sure if this makses sense...but curious if anyone has ideas, or has
> done something like this.
>
> Regards!
> -Joe
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message