mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Using SVD with Canopy/KMeans
Date Sat, 11 Sep 2010 21:43:50 GMT
To put this in bin/mahout speak, this would look like, munging some names and taking liberties
with the actual argument to be passed in:

bin/mahout svd (original -> svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut -> svdT
bin/mahout transpose original -> originalT
bin/mahout matrixmult originalT svdT -> newMatrix
bin/mahout kmeans newMatrix

Is that about right?


On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:

> Ok, the transposed computation seems to work and the cast exception was caused by my
unit test writing LongWritable keys to the testdata file. The following test produces a comparable
answer to the non-distributed case. I still want to rename the method to transposeTimes for
clarity. And better, implement timesTranspose to make this particular computation more efficient:
> 
>  public void testKmeansDSVD() throws Exception {
>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>    Path output = getTestTempDirPath("output");
>    Path tmp = getTestTempDirPath("tmp");
>    Path eigenvectors = new Path(output, "eigenvectors");
>    int desiredRank = 13;
>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>    Configuration config = new Configuration();
>    solver.setConf(config);
>    Path testData = getTestTempDirPath("testdata");
>    int sampleDimension = sampleData.get(0).get().size();
>    solver.run(testData, tmp, eigenvectors, sampleData.size(), sampleDimension, false,
desiredRank);
> 
>    // now multiply the testdata matrix and the eigenvector matrix
>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, tmp, desiredRank
- 1, sampleDimension);
>    JobConf conf = new JobConf(config);
>    svdT.configure(conf);
>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, sampleData.size(),
sampleDimension);
>    a.configure(conf);
>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>    sData.configure(conf);
> 
>    // now run the Canopy job to prime kMeans canopies
>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false, false);
>    // now run the KMeans job
>    KMeansDriver.runJob(sData.getRowPath(), new Path(output, "clusters-0"), output, measure,
0.001, 10, 1, true, false);
>    // run ClusterDumper
>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-2"), new
Path(output, "clusteredPoints"));
>    clusterDumper.printClusters(termDictionary);
>  }
> 
> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>> Looking at the single unit test of DMR.times() it seems to be implementing Matrix
expected = inputA.transpose().times(inputB), and not inputA.times(inputB.transpose()), so
the bounds checking is correct as implemented. But the method still has the wrong name and
AFAICT is not useful for performing this particular computation. Should I use this instead?
>> 
>> DistributedRowMatrix sData = a.transpose().t[ransposeT]imes(svdT.transpose())
>> 
>> ugh! And it still fails with:
>> 
>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.IntWritable
>>    at org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


Mime
View raw message