mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zehao Jin" <zehao...@gmail.com>
Subject Re: A Mahout Naive Bayes classifier problem
Date Fri, 11 May 2012 01:28:10 GMT
Thanks for your help, Robin Anil, Lance Norskog and Nimesh Parikh. I've successfully completed
the Chinese texts classification, I think it's the total number of the term's problem ,it's
to large. It also could be my fault that some punctuations that i forgot to remove.Thanks
again. 


Zehao Jin,SCUT , China.

From: Nimesh Parikh
Date: 2012-05-08 22:11
To: user
Subject: Re: A Mahout Naive Bayes classifier problem
Well, You can take a chance with changing parameter "UTF-8" to something
else..

Thanks,
Nimesh

On Sat, May 5, 2012 at 5:53 AM, Lance Norskog <goksron@gmail.com> wrote:

> Yes, it could be the charset problem. Also, it could be the total
> number of terms you supply.
>
> Which analyzer do you use? It is the Lucene "CJKAnalyzer"? This
> creates bigrams of all successive words, and so the number of unique
> terms explodes. This will cause the Hadoop job to explode. The "Smart
> Chinese Analyzer" uses a trained model to split words into 1-, 2- and
> 3-word clusters. The "Standard Analyzer" will split all CJK words into
> single terms. Given that this is a Bayesian model, the Bayesian
> assumption would be that single terms are good enough. I would go with
> the StandardAnalyzer.
>
> (I learned all of this just now in my day job in the Lucene business.)
>
> On Fri, May 4, 2012 at 6:32 AM, Robin Anil <robin.anil@gmail.com> wrote:
> > Can you provide the console output when you run train or test
> > On May 4, 2012 8:09 AM, "Zehao Jin" <zehaojin@gmail.com> wrote:
> >
> >> **
> >> Dear all,
> >> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier
> for
> >> text classification.To get started, I followed the example of Twenty
> >> NewsGroup:
> >> 1.Start the Hadoop clusters.
> >> 2.Run the 20 newsgroup example by executing the script:
> >> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
> >> 3.Finally I got the result same Confusion Matrix as you put here:
> >> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
> >> But I have to classifier the Chinese texts, I had no clue, so I read the
> >> shell script:examples/bin/build-20news-bayes.sh and I knew how this
> example
> >> processed.Then I did like the script:
> >> 1.Preparing Training Data.
> >> The script use
> >> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the
> >> E-mail texts and gets one document per line,the label and the words,you
> >> know,the Chinese is different from English,the words cannot splitted by
> a
> >> space,different combination have different meaning, so I used a Chinese
> >> text analyzer to split the words, and match the format. Each line is
> like
> >> this: Label+'\t'+word1 word2 ....+'\n';
> >> The example's analyzer output :
> >>  And the Chinese anlyzer output:
> >>
> >> 2.Put the formatted train data and the test data to HDFS.(My Hadoop
> >> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20
> >> categories, and my corpus has 10 categoris:
> >> The example:              My categories:
> >>
> >> 3.Train the classifier and test the classifier on Hadoop.
> >> The example do like this:
> >>
> >> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o
> /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs
> >>
> >>   ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d
> /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method
> mapreduce
> >> And my commands are absolutely accord the example,the only difference is
> >> the directory.
> >>
> >> Strangely I cannot get the result as the example,I ran the program
> several
> >> times, but the mapreduce job always fail!
> >> Task xxx failed to report status for 600 seconds.Killing.
> >>
> >> What I want to ask that are the mahout trainclassifer (
> >> ./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout
> testclassifier
> >> xxx) codes fit for my program ? Or it can only be used by the 20
> >> newsgroup example? if they cannot be used ,it's really hard for me to
> >> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many
> >> problems are occurred by this. Can you give me some support? I
> scratched my
> >> head for a few days. Thank you very much!!!
> >> ------------------------------
> >>  Zehao Jin,SCUT , China.
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message