lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msftbl...@aol.com
Subject Classifier4J and Lucene
Date Sun, 23 Oct 2005 14:47:09 GMT
Hey-
 
I have an indexer at my company that I wrote while back that indexes database content (users
and their profile)...one of the next req. of the project is to avoid 'spam' in hits. For example
if I do a search for oracle, and oracle is in 25 places in someones bio field...and another
person has it in one place in his company field, the 25 places will of course be higher. Unfortunatly,
people who know the system know the more you have certain keywords in you user profile, the
higher you will be on the list. I was thinking I can do one of two things:
 
1. Work with Lucene algo to lower scores in certain fields (boost in others)...this would
work, but the boost has such a small factor in scoring (or so it seems), that in some cases
it won't matter. (if I boost company to 2.0, and bio to 1.0 in some cases with xxx hits in
bio, that is still first in score)
 
2. Using Classifier4J (http://classifier4j.sourceforge.net/)...I can use same idea as a mail
filter and use the Bayesian Classifier to train it that certain words would be spam...then
just index the summary. Throwing this out there...not even sure that it will work...
 
Not sure if this makses sense...but curious if anyone has ideas, or has done something like
this.
 
Regards!
-Joe

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message