mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Re: unknown test data twenty-newsgroups example
Date Fri, 01 Oct 2010 10:55:29 GMT
> Let me list what I understood. Pl confirm if I got it correct?
> Add duplicate extra lines many times in an extra file (conforming to the
> format required by the Bayes Classifier) in the format
> <class-name1><tab><word1> <word2>
> If I want to increase the weight of word1 and word2, so that text with
> those words have higher chance of getting classified as <class-name1>
> *
> *
No. Duplicating lines increases DF and therefore decreases (IDF == inverse
document frequency) So weight goes down. To increase weight of the word
repeat the word in the same line


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message