mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD for dimensional reduction + Kmeans
Date Fri, 10 Aug 2012 18:07:28 GMT
It happens because of internal constraints stemming from blocking. it
happens when a split of A (input) has less than (k+p) rows at which
point blocks are too small (or rather, to short) to successfully
perform a QR on .

This also means, among other things, k+p cannot be more than your
total number of rows in the input.

It is also possible that input A is way too wide or k+p is way too big
so that an arbitrary split does not fetch at least k+p rows of A, but
in practice i haven't seen such cases in practice yet. If that
happens, there's an option to increase minSplitSize (which would
undermine MR mappers efficiency  somewhat). But i am pretty sure it is
not your case.

But if your input is shorter than k+p, then it is a case too small for
SSVD. in fact, it probably means you can solve test directly in memory
with any solver. You can still use SSVD with k=m and p=0 (I think) in
this case and get exact (non-reduced rank) decomposition equivalent
with no stochastic effects, but that is not what it is for really.

Assuming your input is m x n, can you tell me please what your m, n, k
and p are?

thanks.
-D

On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> There seems to be some internal constraint on k and/or p, which is making a test difficult.
The test has a very small input doc set and choosing the wrong k it is very easy to get a
failure with this message:
>
>         java.lang.IllegalArgumentException: new m can't be less than n
>                 at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>
> I have a working test but I had to add some docs to the test data and have tried to reverse
engineer the value for k (desiredRank). I came up with the following but I think it is only
an accident that it works.
>
>         int p = 15; //default value for CLI
>         int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ??????
not sure why this works
>
> This seems likely to be an issue only because of the very small data set and the relationship
of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?)
could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set
it would help.
>
> This test is derived from a non-active SVD test but I'd be up for cleaning it up and
including it as an example in the working but non-active tests. I also fixed a couple trivial
bugs in the non-active Lanczos tests for what it's worth.
>
>
> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
> Reading "overview and usage" doc linked on that page
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
> should help to clarify outputs and usage.
>
>
> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>> Quoth Grant Ingersoll:
>>>> To put this in bin/mahout speak, this would look like, munging some names
and taking liberties with the actual argument to be passed in:
>>>>
>>>> bin/mahout svd (original -> svdOut)
>>>> bin/mahout cleansvd ...
>>>> bin/mahout transpose svdOut -> svdT
>>>> bin/mahout transpose original -> originalT
>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>> bin/mahout kmeans newMatrix
>>>
>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does
SSVD require the EigenVerificationJob to clean the eigen vectors?
>>
>> No
>>
>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS?
Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through
the EigenVerificationJob?
>> no
>>
>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>
>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>

Mime
View raw message