mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Loic Descotte <loic.desco...@kelkoo.com>
Subject SGD vs Naive Bayes for classification
Date Fri, 09 Sep 2011 15:41:50 GMT
Hello,

First mail for me on Mahout ML :)

I'm working on a classification problem and I'm trying to know which 
algorythm would be better for my needs.
I've read that SGD is better than Naive Bayes for small-medium data 
sets. Does it mean that learning (train) data may be small or is it for 
small data sets (or both) ?
Then, does "better" mean faster or does it also give more accurate 
results than Naive Bayes on this size of data sets?

My goal is to make prediction on thousands of text entries, but with 
smaller as possible learning datas (categories may often change so I 
will not always have hundreds of entries for training on each category).

Another question, in all exemples I've found, Naive Bayes is used to 
analyze sets containing a lot keywords, and to classify them in the 
right category (e.g wikipedia examples : 
https://www.ibm.com/developerworks/java/library/j-mahout/#N10412 ).

SGD example are a little different, instead of working on word 
sequences, they use many predictors values and each predictor has only 
one value for each entry.

E.G  (in mahout in action) :

  $MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv \
--output ./model \
--target color --categories 2 \
*--predictors x y --types numeric \*
--features 20 --passes 100 --rate 50

In this example, for each entry the x and y predictor has only one value.

My need is more like the naive bayes wikipedia examples : I want to 
analyse a text and to automatically find its cateogry. So I have only 
one predictor variable (the words of the text) and this predictor 
variable is multivalued (several words).

Is it possible to use the SGD algorythm (maybe better for me because I 
have small datasets) with only text (like blog posts) entries ?

Thanks a lot for your time, tell me if I'm not clear enough in my 
explainations :)

Loic

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message