mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Busjaeger <busjae...@googlemail.com>
Subject clusterdump lucene document ID
Date Fri, 11 May 2012 07:30:36 GMT
I am trying to cluster documents stored in a lucene index using the 
command line tools. How can I obtain the original document IDs from the 
clustering output?


Here is the sequence of commands I am using:

./mahout lucene.vector --dir $index_path --output /tmp/mahout/vector 
--field content --dictOut /tmp/mahout/dict --idField _uid -md 2 -w TFIDF 
-x 70

./mahout canopy -i /tmp/mahout/vector -o /tmp/mahout_canopy -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure --t1 10 --t2 5

./mahout kmeans -i /tmp/mahout/vector -c 
/tmp/mahout_canopy/clusters-0-final/part-r-00000 -o /tmp/mahout_kmeans 
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 20 -x 20 
-cd 0.1

./mahout clusterdump -dt text -d /tmp/mahout/dict -s 
/tmp/mahout_kmeans/clusters-1-final/ -b 20 -n 20


A similar question was asked on this thread [1], but I did not see a 
resolution. Thanks in advance for your help!

- Ben


[1] 
http://mail-archives.apache.org/mod_mbox/mahout-user/201204.mbox/%3CCA+y9ocWgS2se7dOqQrsE3p+QE5GVXCt8XUTucFdZvGkJkPOaew@mail.gmail.com%3E



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message