mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Using SVD with Canopy/KMeans
Date Mon, 20 Sep 2010 14:22:19 GMT
  Hi Derek,

I think this is caused by the fact that the SVD output seems to emit 
only desiredRank-1 eigenvectors in the rawEigenvectors directory. When 
that is transposed it would yield a p matrix with zero entries in the 
last column that you have observed. The code that's doing this is in 
DistributedLanczosSolver.serializeOutput() and the line responsible is:

     for (int i = 0; i < eigenVectors.numRows() - 1; i++) {

I thought that curious but don't understand Lanczos well enough yet to 
be too critical. Perhaps you could try removing the -1 and see if it 
improves your results.


On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
> Hi Jeff,
>
> I've been trying out the latest version of the svd code in TestClusterDumper this week
(actually I'm using my modified version of it as I mentioned in my original post at the start
of the thread, with your latest changes). I suspect there's a problem with the EigenVerificationJob
called from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank,
0.5, 0.0, true);
>
> The generated 'p' matrix (read from the clean eigenvectors file) will always have the
value 0 for the (desiredRank - 1) column in each row. E.g., here's the first row:
>
> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673,
0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546,
0.0]
>
> This then means that the sData matrix will have 0s in this column following multiplication.
However, when I change testKmeansSVD() to run the solver without the clean step, and load
the raw eigenvectors into 'p' i.e.
> .
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank);
>
> 'p' now has values other than 0 in the last column, e.g. here's the first row:
>
> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673,
0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546,
-0.04870849090364153]
>
> I'm guessing there's a problem with the clean step here, or is this normal behaviour?
>
> FYI I noticed the problem when running the solver + clean on my own data, and then running
the Dirichlet clusterer on the reduced data. I found that after a couple of iterations, things
started to go wrong with Dirichlet as the following code in UncommonDistribution.rMultinom()
was being called:
>
>      // can't happen except for round-off error so we don't care what we return here
>      return 0;
>
> I suspect this might be associated with the fact that the last column in my reduced data
matrix is 0, although I haven't confirmed it yet.
>
> Thanks,
>
> Derek


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message