mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <>
Subject rowid conversion step to prepare input vectors for cvb clustering
Date Thu, 31 May 2012 23:18:32 GMT
I have a question about using rowid to convert sparse vectors (generated via seq2sparse) to
the form needed for cvb clustering (i.e., to change the Text key to an Integer).  Prior to
running this step I had 3 “part” files in my tf-vectors folder.  After running rowid
on the tf-vectors folder it generates one “Matrix“ file and a “docIndex” file.  The
result of this step is that when running the cvb clustering on the folder containing “Matrix”
only a single mapper runs on one node.  For a large collection this takes an excessive amount
of time to run.   

I assume cvb should be able to run in a distributed fashion on multiple nodes using many
mappers/tasktrackers?  If so, am I running rowid incorrectly on the entire tf-vectors folder
as opposed to separately on each “part” file in tf-vectors?  Of course it generates the
name “Matrix” in output so this implies it wants to generate a single file.

Any advice on running cvb using multiple mappers would be appreciated.  The following are
some pertinent lines from my test shell script to process Reuters data:
  $MAHOUT2 seq2sparse \
    -i ${WORK_DIR}/reuters-out-seqdir/ \
    -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
    -wt tf -seq -nr 3 --namedVector \
  && \
  $MAHOUT rowid \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
    -o ${WORK_DIR}/sparse-vectors-cvb \
  && \
  $HADOOP fs -mv ${WORK_DIR}/sparse-vectors-cvb/docIndex ${WORK_DIR}/sparse-vectors-index-cvb
  && \
  $MAHOUT cvb \
    -i ${WORK_DIR}/sparse-vectors-cvb \
    -o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
    -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message