From Varnit Khanna <>
Subject Issues with running Mahout LDA over the Reuters data set (Mahout in Action)
Date Wed, 09 Nov 2011 17:48:38 GMT
I am trying to run the Mahout LDA over the Reuters data set as
described in Mahout in Action however I always get only 1 topic
returned. I am running on Mahout 0.5 and here are my steps:

$ mvn -e -q exec:java
-Dexec.args="reuters/ reuters-extracted/"

Next I had to put the output directory (reuters-extracted) into HDFS
which wasn't mentioned in the book.

$ hadoop dfs -put reuters-extracted/* reuters/

$ bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles
$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
$ bin/mahout lda  -i reuters-vectors/tf-vectors  -o reuters-lda-sparse
-k 10 -v 70000 -x 20

$ bin/mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i
reuters-lda-sparse/state-20/ -d reuters-vectors/dictionary.file-* -dt
sequencefile -w 5
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
No HADOOP_CONF_DIR set, using /usr/lib/hadoop/src/conf
11/11/09 17:43:55 WARN driver.MahoutDriver: No
org.apache.mahout.clustering.lda.LDAPrintTopics.props found on
classpath, will use command-line arguments only
Topic 0
pct [p(pct|topic_0) = 0.04985000259283585
from [p(from|topic_0) = 0.04332905057607894
said [p(said|topic_0) = 0.03736886059106963
1986 [p(1986|topic_0) = 0.015418741367019371
dlrs [p(dlrs|topic_0) = 0.014674464223644563
11/11/09 17:44:01 INFO driver.MahoutDriver: Program took 6337 ms

Any suggestions?


