mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhaskar Ghosh <bjgin...@yahoo.co.in>
Subject Re: unknown test data twenty-newsgroups example
Date Fri, 01 Oct 2010 19:56:55 GMT
Thanks Ted, Robin, and Neil. I am now clear of my doubts, and would try the 
approach now.
 Regards
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"




________________________________
From: Ted Dunning <ted.dunning@gmail.com>
To: user@mahout.apache.org
Cc: Bhaskar Ghosh <bjgindia@yahoo.co.in>; neil.ghosh@gmail.com
Sent: Sat, 2 October, 2010 12:11:53 AM
Subject: Re: unknown test data twenty-newsgroups example


Yes.  Instance = training example.

Your method of duplicating lines is just what Robin meant.


On Fri, Oct 1, 2010 at 3:55 AM, Robin Anil <robin.anil@gmail.com> wrote:

> Let me list what I understood. Pl confirm if I got it correct?
>>
>> Add duplicate extra lines many times in an extra file (conforming to the
>> format required by the Bayes Classifier) in the format
>> <class-name1><tab><word1> <word2>
>> If I want to increase the weight of word1 and word2, so that text with
>> those words have higher chance of getting classified as <class-name1>
>>
>> *
>> *
>>
>No. Duplicating lines increases DF and therefore decreases (IDF == inverse
>document frequency) So weight goes down. To increase weight of the word
>repeat the word in the same line
>
>
>Regards
>Robin
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message