mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Problem with CVB
Date Mon, 27 Jan 2014 18:43:33 GMT
I am forwarding this to the list for Peyman.


-----------------------------------------------------------------

I am trying to run the CVB (Mahout 0.8) on a directory of plain text files,
following the procedure outlined below. However, I am not able to see the
vectordump (step 6). Run without the "-c csv" flag the generated file is
empty. However, if I use the flag "-c csv" the generated file starts with a
series of numbers followed by an alphabetically organized series of
unigrams (see below)


#1,10,1163,12,121,13,14,141,1462,15,16,17,185,1901,197,2,201,2227,23,283,298,3,331,35,4,402,4351,445,5,57,58,6,68,7,9,987,a.m,ab,abc,abercrombie,abercrombies,ability

Can someone point out what I am doing wrong?

thank you



0: Set Paths

    > export HDFS_PATH=/path/to/hdfs/
    > export LOCAL_PATH=/path/to/localfs


1: Put docs in HDFS using hadoop fs -put [-put <localsrc> ... <dst>]

    > hadoop fs -put $LOCAL_PATH/test $HDFS_PATH/rawdata

2: Generate sequence files (of Text) from a directory

    > mahout seqdirectory \
    -i $HDFS_PATH/rawdata \
    -o $HDFS_PATH/sequenced \
    -c UTF-8 -chunk 5

3- Generate sparse Vector from Text sequence files

    > mahout seq2sparse \
    -i $HDFS_PATH/sequenced \
    -o $HDFS_PATH/sparseVectors \
    -ow --maxDFPercent 85 --namedVector --weight tf


4- rowid: : Map SequenceFile<Text,VectorWritable> to
{SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}

    > mahout rowid \
    -i $HDFS_PATH/sparseVectors/tfidf-vectors \
    -o $HDFS_PATH/matrix

5- run cvb

    > mahout cvb \
    -i $HDFS_PATH/matrix/matrix \
    -o $HDFS_PATH/test-lda \
    -k 100 -ow -x 40 \
    -dict $HDFS_PATH/sparseVectors/dictionary.file-0 \
    -dt $HDFS_PATH/test-lda-topics \
    -mt $HDFS_PATH/test-lda-model

6- Dump vectors from a sequence file to text

    > mahout vectordump \
    -i $HDFS_PATH/test-lda-topics/part-m-00000 \
    -o $LOCAL_PATH/vectordump \
    -vs 10 -p true \
    -d $HDFS_PATH/sparseVectors/dictionary.file-0 \
    -dt sequencefile \
    -sort $HDFS_PATH/test-lda-topics/part-m-00000 \
    -c csv
    ;  cat $LOCAL_PATH/vectordump

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message