Return-Path: Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: (qmail 12641 invoked from network); 26 Sep 2010 08:28:38 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Sep 2010 08:28:38 -0000 Received: (qmail 6712 invoked by uid 500); 26 Sep 2010 08:28:37 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 6585 invoked by uid 500); 26 Sep 2010 08:28:35 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 6574 invoked by uid 99); 26 Sep 2010 08:28:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Sep 2010 08:28:35 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Sep 2010 08:28:31 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o8Q8SAq7000532 for ; Sun, 26 Sep 2010 08:28:10 GMT Date: Sun, 26 Sep 2010 04:28:10 -0400 (EDT) From: confluence@apache.org To: commits@mahout.apache.org Message-ID: <22587929.19468.1285489690205.JavaMail.confluence@thor> Subject: [CONF] Apache Mahout > Twenty Newsgroups MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Twenty Newsgroups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups) Change Comment: --------------------------------------------------------------------- -source is a mandatory parameter while running trainer on hadoop Edited by Neil Ghosh: --------------------------------------------------------------------- h2. Twenty Newsgroups Classification Example h2. Prerequisites * Mahout has been downloaded ([instructions here|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup]) * Maven is available * Your environment has the following variables: | {{HADOOP_HOME}} | Environment variables refers to where Hadoop lives | | {{MAHOUT_HOME}} | Environment variables refers to where Mahout lives | h2. Setting up Mahout After downloading the distribution, unzip/untar it into the directory of your choice and do: # In the trunk directory, compile everything and create the Hadoop job: {noformat}mvn install{noformat} h3. For Mahout 0.1: {HTMLcomment} # If you've run this before, you may want to rm -rf the work and temp directories {HTMLcomment} In {{$MAHOOT_HOME/examples}}: {noformat} $ cd examples $ ant -f build-deprecated.xml get-files $ mkdir lib $ mvn dependency:copy-dependencies -DoutputDirectory=lib $ ant -f build-deprecated.xml extract-20news-18828 -Ddest=target $ mv 20news-18828-collapse 20news-input {noformat} NOTE: After you have done this, skip to the hadoop section to run the 20newsgroups example in mahout 0.1 release h3. For Mahout 0.2+: # Change into {{$MAHOOT_HOME/examples}} directory {noformat}$ cd examples{noformat} # Download {{20news-18828.tar.gz}} from the [20newsgroups dataset|http://people.csail.mit.edu/jrennie/20Newsgroups/] # Extract dataset: {noformat}$ tar zxf 20news-18828.tar.gz{noformat} # Generate input dataset: {noformat} $ mkdir 20news-input $ mvn -e exec:java \ -Dexec.mainClass=org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \ -Dexec.args="-p 20news-18828 \ -o 20news-input \ -a org.apache.lucene.analysis.standard.StandardAnalyzer \ -c UTF-8" {noformat} h2. Running the example without hadoop This assumes you're running Mahout 0.2+: # Train the classifier: {noformat} $ mvn -e exec:java \ -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier \ -Dexec.args="-i 20news-input \ -o 20news-model \ -type cbayes \ -ng 1 \ -source hdfs" {noformat} # Test over input: {noformat} $ mvn -e exec:java \ -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier \ -Dexec.args="-m 20news-model \ -d 20news-input \ -type cbayes \ -ng 1 \ -source hdfs \ -method sequential" {noformat} h2. Running 20newsgroups example over hadoop cluster h3. Set Up Hadoop Cluster # Edit {{hadoop-site.xml}}; add in local settings per [Hadoop quickstart|http://hadoop.apache.org/core/docs/current/quickstart.html] {noformat} emacs $HADOOP_HOME/conf/hadoop-site.xml {noformat} # Format the HDFS {noformat} $ $HADOOP_HOME/bin/hadoop namenode -format {noformat} # Start Hadoop {noformat} $ $HADOOP_HOME/bin/start-all.sh {noformat} # Copy extracted text to HDFS {noformat} $ $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/20news-input 20news-input {noformat} h3. Train Bayes Classifier Using Tri-grams The following will run 4 map reduce jobs on Hadoop to train the classifier and will take a while on a single node machine. {noformat} $HADOOP_HOME/bin/hadoop \ jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.1.job \ org.apache.mahout.classifier.bayes.TrainClassifier \ -i 20news-input \ -o newsmodel \ -ng 3 \ -type bayes -source hdfs {noformat} You can monitor the status of these jobs by opening a web browser on your Job Tracker node: [http://localhost:50030/jobtracker.jsp] Test classifier over the input folder {noformat} $HADOOP_HOME/bin/hadoop \ jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.1.job \ org.apache.mahout.classifier.bayes.TestClassifier \ -p newsmodel \ -t work/20news-input \ -ng 3 \ -type bayes {noformat} Output might look like: {noformat} 08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20 08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model 08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism96.9962453066333775/799.0 08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics99.28057553956835966/973.0 08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc96.95431472081218955/985.0 08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware99.59266802443992978/982.0 08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware99.47970863683663956/961.0 08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x99.59183673469387976/980.0 08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale98.45679012345678957/972.0 08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos99.4949494949495985/990.0 08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles100.0994/994.0 08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball99.89939637826961993/994.0 08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey99.89989989989989998/999.0 08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt99.39455095862765985/991.0 08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics98.98063200815494971/981.0 08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med99.79797979797979988/990.0 08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space99.3920972644377981/987.0 08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian99.49849548645938992/997.0 08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns99.45054945054945905/910.0 08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast98.82978723404256929/940.0 08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc89.93548387096774697/775.0 08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc61.78343949044586388/628.0 08/11/07 16:58:25 INFO bayes.TestClassifier: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 18369 97.5621% Incorrectly Classified Instances : 459 2.4379% Total Classified Instances : 18828 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 994 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 994 a = rec.motorcycles 0 976 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 1 | 980 b = comp.windows.x 7 0 929 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 | 940 c = talk.politics.mideast 0 0 0 905 0 0 1 0 0 0 0 0 0 0 0 0 3 0 1 0 | 910 d = talk.politics.guns 4 1 4 27 388 1 0 1 0 5 1 1 2 2 149 7 2 33 0 0 | 628 e = talk.religion.misc 3 0 0 0 0 985 0 1 0 0 0 0 0 1 0 0 0 0 0 0 | 990 f = rec.autos 0 0 0 0 0 0 993 1 0 0 0 0 0 0 0 0 0 0 0 0 | 994 g = rec.sport.baseball 0 0 0 0 0 0 1 998 0 0 0 0 0 0 0 0 0 0 0 0 | 999 h = rec.sport.hockey 0 0 0 0 0 0 0 0 956 0 2 0 0 0 0 0 0 0 2 1 | 961 i = comp.sys.mac.hardware 0 0 0 0 0 0 0 0 0 981 0 0 5 0 0 1 0 0 0 0 | 987 j = sci.space 0 0 0 0 0 0 0 0 0 0 978 0 1 0 0 0 0 0 2 1 | 982 k = comp.sys.ibm.pc.hardware 1 0 3 36 0 1 2 1 0 5 0 697 4 0 3 3 19 0 0 0 | 775 l = talk.politics.misc 0 2 0 0 0 0 0 0 0 0 2 0 966 0 0 0 0 0 2 1 | 973 m = comp.graphics 1 0 0 0 0 0 0 0 0 0 6 0 0 971 0 0 0 0 3 0 | 981 n = sci.electronics 1 0 0 0 0 0 0 0 1 0 0 0 0 0 992 1 0 1 0 1 | 997 o = soc.religion.christian 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 988 0 0 0 1 | 990 p = sci.med 0 0 0 2 0 0 0 0 0 0 0 0 2 1 0 0 985 0 1 0 | 991 q = sci.crypt 0 0 0 1 1 0 0 0 0 1 0 0 1 0 19 0 1 775 0 0 | 799 r = alt.atheism 1 0 0 0 0 3 1 2 0 0 3 0 0 5 0 0 0 0 957 0 | 972 s = misc.forsale 0 0 0 8 0 0 0 0 0 0 6 0 6 0 0 0 0 0 10 955 | 985 t = comp.os.ms-windows.misc {noformat} h2. Complementary Naive Bayes To Train a CBayes Classifier using bi-grams {noformat} $HADOOP_HOME/bin/hadoop \ jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.1.job \ org.apache.mahout.classifier.bayes.TrainClassifier \ -i 20news-input \ -o 20news-model \ -ng 2 \ -type cbayes \ -source {noformat} To Test a CBayes Classifier using bi-grams {noformat} $HADOOP_HOME/bin/hadoop \ jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.1.job \ org.apache.mahout.classifier.bayes.TestClassifier \ -p 20news-model \ -t work/20news-input \ -ng 2 \ -type cbayes \ -source \ -method {noformat} Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action