Hi Jeff, Jake, Grant,
Thanks for the replies and for the code. It was the matrix
multiplication step that I wasn't seeing, as I wasn't really sure what
output was being produced by the SVD solver. The code is a big help,
it's clear to me now.
Thanks again,
Derek
On 03/09/10 06:37, Jeff Eastman wrote:
> Here's a new test method for TestClusterDumper that I think does what
> Jake describes below. I have another one that uses
> DistributedRowMatrix but I'm still debugging it. I can commit this or
> both if folks find it useful:
>
> public void testKmeansSVD() throws Exception {
> DistanceMeasure measure = new EuclideanDistanceMeasure();
> Path output = getTestTempDirPath("output");
> Path tmp = getTestTempDirPath("tmp");
> Path eigenvectors = new Path(output, "eigenvectors");
> int desiredRank = 15;
> DistributedLanczosSolver solver = new DistributedLanczosSolver();
> Configuration conf = new Configuration();
> solver.setConf(conf);
> Path testData = getTestTempDirPath("testdata");
> int sampleDimension = sampleData.get(0).get().size();
> solver.run(testData, tmp, eigenvectors, sampleData.size(),
> sampleDimension, false, desiredRank);
> // build inmemory data matrix A
> Matrix a = new DenseMatrix(sampleData.size(), sampleDimension);
> int i = 0;
> for (VectorWritable vw : sampleData) {
> a.assignRow(i++, vw.get());
> }
> // extract the eigenvectors into P
> Matrix p = new DenseMatrix(39, desiredRank  1);
> FileSystem fs = FileSystem.get(eigenvectors.toUri(), conf);
> SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> eigenvectors, conf);
> try {
> Writable key = (Writable) reader.getKeyClass().newInstance();
> Writable value = (Writable) reader.getValueClass().newInstance();
> i = 0;
> while (reader.next(key, value)) {
> VectorWritable vw = (VectorWritable) value;
> NamedVector v = (NamedVector) vw.get();
> p.assignColumn(i, v);
> System.out.println("k=" + key.toString() + " V=" +
> AbstractCluster.formatVector(v, termDictionary));
> value = (Writable) reader.getValueClass().newInstance();
> i++;
> }
> } finally {
> reader.close();
> }
> // sData = A P
> Matrix sData = a.times(p);
>
> // now write sData back to file system so clustering can run
> against it
> Path svdData = new Path(output, "svddata");
> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> svdData, IntWritable.class, VectorWritable.class);
> try {
> IntWritable key = new IntWritable();
> VectorWritable value = new VectorWritable();
>
> for (int row = 0; row < sData.numRows(); row++) {
> key.set(row);
> value.set(sData.getRow(row));
> writer.append(key, value);
> }
> } finally {
> writer.close();
> }
> // now run the Canopy job to prime kMeans canopies
> CanopyDriver.runJob(svdData, output, measure, 8, 4, false, false);
> // now run the KMeans job
> KMeansDriver.runJob(svdData, new Path(output, "clusters0"),
> output, measure, 0.001, 10, 1, true, false);
> // run ClusterDumper
> ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> "clusters2"), new Path(output, "clusteredPoints"));
> clusterDumper.printClusters(termDictionary);
> }
>
>
>
> On 9/2/10 10:50 AM, Jake Mannix wrote:
>> Derek,
>>
>> The step Jeff's referring to is that the SVD output is a set of
>> vectors in
>> the "column space" of your original set of rows (your input matrix).
>> If you
>> want to cluster your original data, projected onto this new SVD
>> basis, you
>> need to matrix multiply your SVD matrix by your original data.
>> Depending on
>> how big your data is (number of rows and columns and rank of the
>> reduction),
>> you can do this in either one or two mapreduce passes.
>>
>> If you need more detail, I can spell that out a little more
>> directly. It
>> should actually be not just explained in words, but coded into the
>> examples,
>> now that I think of it... need. more. hours. in. day....
>>
>> jake
