mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kmeans vectors
Date Wed, 29 Sep 2010 19:29:46 GMT
  Hi Matt,

 From your command arguments, it looks like you are running 0.3. Due to 
the rate of change in Mahout we recommend you check out trunk and use 
that instead. With a little tweaking (added a --charset ASCII on 
seqdirectory) I was able to get as far as you did on trunk but 
seq2sparse is not what you want to use.

The utilities you are using are intended for text preprocessing, to get 
documents word-counted, into term vector sequenceFiles and then running 
TF and/or TF-IDF processing on the results to produce VectorWritable 
sequence files suitable for clustering. For your problem, I suggest you 
instead look at the Synthetic Control clustering examples, starting with 
Canopy. These use an InputDriver to process text files containing 
space-delimited numbers like your data.dat file and produce the 
VectorWritable sequence files directly.

I was able to run this on your data using trunk and it produced 3 
clusters. You should be able to run the other synthetic control jobs on 
it too:

CommandLine:
./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
-i data \
-o output \
-t1 3 \
-t2 2 \
-ow \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure

Clusters output:
C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
     Weight:  Point:
     1.0: [22.000, 21.000]
C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
     Weight:  Point:
     1.0: [19.000, 20.000]
     1.0: [18.000, 22.000]
C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
     Weight:  Point:
     1.0: [1.000, 3.000]
     1.0: [3.000, 2.000]


Good hunting,
Jeff

On 9/29/10 2:26 PM, Matt Tanquary wrote:
> I was able to run the tutorials, etc. Now I would like to generate my
> own small test.
>
> I have created a data.dat file and put these contents:
> 22 21
> 19 20
> 18 22
> 1 3
> 3 2
>
> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir
>
> This created kmeans/seqdir/chunk-o in my dfs with the following content:
> ¼/%
>          /data.dat22 21
> 19 20
> 18 22
> 1 3
> 3 2
>
> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>
> This generated several things in kmeans/input including the
> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
> which contains:
> øÏân
>          /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>       /data.dat@@
>
> It does not seem to have the numeric data at this point.
>
> I am hoping someone can shed some light on how I can get my datapoint
> file into the proper vector format for running mahout kmeans.
>
> Just fyi, when I run kmeans against that file (mahout kmeans -i
> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
> -w) I get:
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1, Size: 1
>          at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
> which tells me it was unable to find even 1 vector in the given input folder.
>
> Thanks for any comments you provide.
> -M@


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message