mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek O'Callaghan <derek.ocallag...@ucd.ie>
Subject Re: Using SVD with Canopy/KMeans
Date Fri, 03 Sep 2010 09:20:12 GMT
Hi Jeff, Jake, Grant,

Thanks for the replies and for the code. It was the matrix 
multiplication step that I wasn't seeing, as I wasn't really sure what 
output was being produced by the SVD solver. The code is a big help, 
it's clear to me now.

Thanks again,

Derek

On 03/09/10 06:37, Jeff Eastman wrote:
>  Here's a new test method for TestClusterDumper that I think does what 
> Jake describes below. I have another one that uses 
> DistributedRowMatrix but I'm still debugging it. I can commit this or 
> both if folks find it useful:
>
>    public void testKmeansSVD() throws Exception {
>     DistanceMeasure measure = new EuclideanDistanceMeasure();
>     Path output = getTestTempDirPath("output");
>     Path tmp = getTestTempDirPath("tmp");
>     Path eigenvectors = new Path(output, "eigenvectors");
>     int desiredRank = 15;
>     DistributedLanczosSolver solver = new DistributedLanczosSolver();
>     Configuration conf = new Configuration();
>     solver.setConf(conf);
>     Path testData = getTestTempDirPath("testdata");
>     int sampleDimension = sampleData.get(0).get().size();
>     solver.run(testData, tmp, eigenvectors, sampleData.size(), 
> sampleDimension, false, desiredRank);
>     // build in-memory data matrix A
>     Matrix a = new DenseMatrix(sampleData.size(), sampleDimension);
>     int i = 0;
>     for (VectorWritable vw : sampleData) {
>       a.assignRow(i++, vw.get());
>     }
>     // extract the eigenvectors into P
>     Matrix p = new DenseMatrix(39, desiredRank - 1);
>     FileSystem fs = FileSystem.get(eigenvectors.toUri(), conf);
>     SequenceFile.Reader reader = new SequenceFile.Reader(fs, 
> eigenvectors, conf);
>     try {
>       Writable key = (Writable) reader.getKeyClass().newInstance();
>       Writable value = (Writable) reader.getValueClass().newInstance();
>       i = 0;
>       while (reader.next(key, value)) {
>         VectorWritable vw = (VectorWritable) value;
>         NamedVector v = (NamedVector) vw.get();
>         p.assignColumn(i, v);
>         System.out.println("k=" + key.toString() + " V=" + 
> AbstractCluster.formatVector(v, termDictionary));
>         value = (Writable) reader.getValueClass().newInstance();
>         i++;
>       }
>     } finally {
>       reader.close();
>     }
>     // sData = A P
>     Matrix sData = a.times(p);
>
>     // now write sData back to file system so clustering can run 
> against it
>     Path svdData = new Path(output, "svddata");
>     SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, 
> svdData, IntWritable.class, VectorWritable.class);
>     try {
>       IntWritable key = new IntWritable();
>       VectorWritable value = new VectorWritable();
>
>       for (int row = 0; row < sData.numRows(); row++) {
>         key.set(row);
>         value.set(sData.getRow(row));
>         writer.append(key, value);
>       }
>     } finally {
>       writer.close();
>     }
>     // now run the Canopy job to prime kMeans canopies
>     CanopyDriver.runJob(svdData, output, measure, 8, 4, false, false);
>     // now run the KMeans job
>     KMeansDriver.runJob(svdData, new Path(output, "clusters-0"), 
> output, measure, 0.001, 10, 1, true, false);
>     // run ClusterDumper
>     ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
> "clusters-2"), new Path(output, "clusteredPoints"));
>     clusterDumper.printClusters(termDictionary);
>   }
>
>
>
> On 9/2/10 10:50 AM, Jake Mannix wrote:
>> Derek,
>>
>>    The step Jeff's referring to is that the SVD output is a set of 
>> vectors in
>> the "column space" of your original set of rows (your input matrix).  
>> If you
>> want to cluster your original data, projected onto this new SVD 
>> basis, you
>> need to matrix multiply your SVD matrix by your original data.  
>> Depending on
>> how big your data is (number of rows and columns and rank of the 
>> reduction),
>> you can do this in either one or two map-reduce passes.
>>
>>    If you need more detail, I can spell that out a little more 
>> directly.  It
>> should actually be not just explained in words, but coded into the 
>> examples,
>> now that I think of it... need. more. hours. in. day....
>>
>>    -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message