mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <>
Subject Problem running new LDA algorithm (cvb) against the Reuters data
Date Sat, 05 May 2012 03:54:22 GMT
I am attempting to run the new LDA algorithm cvb (Mahout version 0.6) against the Reuters data.  
I just added another 
entry to the example script as follows:
elif [ "x$clustertype" == "xcvb" ]; then
  $MAHOUT seq2sparse \
    -i ${WORK_DIR}/reuters-out-seqdir/ \
    -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
    -wt tf -seq -nr 3 --namedVector \
  && \
  $MAHOUT cvb \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
    -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
    -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
  && \
  $MAHOUT ldatopics \
    -i ${WORK_DIR}/reuters-cvb/state-2 \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
    -dt sequencefile
I successfully ran the previous LDA algorithm against Reuters but I am most interested in
this new implementation of LDA because I want the new feature that generates document-to-cluster
mappings (e.g., parameter –dt).
When I run the above code via Hadoop pseudo distributed mode as well as on a small cluster
I receive the same error from the "mahout cvb" command.  All the pre-clustering logic including
sequence file and sparse vector generation works fine but when the cvb clustering is attempted
the mappers fail with the following error in the Hadoop map task log:
java.lang.ClassCastException: cannot be cast to
 at org.apache.hadoop.mapred.MapTask.runNewMapper(
 at org.apache.hadoop.mapred.Child.main(
Any help with resolving the problem would be appreciated.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message