mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: SSVD for dimensional reduction + Kmeans
Date Fri, 10 Aug 2012 18:57:45 GMT
Got it. Well on to some real and much larger data sets then…

On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <> wrote:

i think actually Mahout's Lanczos requires external knowledge of input
size too, in part for similar reasons. SSVD doesn't because it doesn't
have "other" reasons to know input size but fundamental assumption
rank(input)>=rank(thin SVD) still stands about the input but the
method doesn't have a goal of verifying it explicitly (which would be
kind of hard), and instead either produces 0 eigenvectors or runs into
block deficiency.

It is however hard to assert whether block deficiency stemmed from
input size deficiency vs. split size deficiency, and neither of
situations is typical for a real-life SSVD applications, hence error
message is somewhat vague.

On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <> wrote:
> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
> not a method pecularity.
> The only reason the solution doesn't warn you explicitly is because
> DistributedRowMatrix format, which is just a sequence file of rows,
> would not provide us with an easy way to verify what m actually is
> before it actually iterates over it and runs into block size
> deficiency. So if you now m as an external knowledge, it is easy to
> avoid being trapped by block height defiicency.
> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <> wrote:
>> This is only a test with some trivially simple data. I doubt there are any splits
and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2,
which is in
>> mahout/integration/src/test/java/org/apache/mahout/clustering/
>> I've attached the modified and running version with testKmeansDSSVD
>> As I said I don't think this is a real world test. It tests that the code runs, and
it does. Getting the best results is not part of the scope. I just thought if there was an
easy answer I could clean up the parameters for SSVDSolver.
>> Since it is working I don't know that it's worth the effort unless people are likely
to run into this with larger data sets.
>> Thanks anyway.
>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <> wrote:
>> It happens because of internal constraints stemming from blocking. it
>> happens when a split of A (input) has less than (k+p) rows at which
>> point blocks are too small (or rather, to short) to successfully
>> perform a QR on .
>> This also means, among other things, k+p cannot be more than your
>> total number of rows in the input.
>> It is also possible that input A is way too wide or k+p is way too big
>> so that an arbitrary split does not fetch at least k+p rows of A, but
>> in practice i haven't seen such cases in practice yet. If that
>> happens, there's an option to increase minSplitSize (which would
>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>> not your case.
>> But if your input is shorter than k+p, then it is a case too small for
>> SSVD. in fact, it probably means you can solve test directly in memory
>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>> this case and get exact (non-reduced rank) decomposition equivalent
>> with no stochastic effects, but that is not what it is for really.
>> Assuming your input is m x n, can you tell me please what your m, n, k
>> and p are?
>> thanks.
>> -D
>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <> wrote:
>>> There seems to be some internal constraint on k and/or p, which is making a test
difficult. The test has a very small input doc set and choosing the wrong k it is very easy
to get a failure with this message:
>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>        at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(
>>> I have a working test but I had to add some docs to the test data and have tried
to reverse engineer the value for k (desiredRank). I came up with the following but I think
it is only an accident that it works.
>>> int p = 15; //default value for CLI
>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ??????
not sure why this works
>>> This seems likely to be an issue only because of the very small data set and
the relationship of rows to columns to p to k. But for the purposes of creating a test if
someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions
of the tiny data set it would help.
>>> This test is derived from a non-active SVD test but I'd be up for cleaning it
up and including it as an example in the working but non-active tests. I also fixed a couple
trivial bugs in the non-active Lanczos tests for what it's worth.
>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <> wrote:
>>> Reading "overview and usage" doc linked on that page
>>> should help to clarify outputs and usage.
>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <> wrote:
>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <> wrote:
>>>>> Quoth Grant Ingersoll:
>>>>>> To put this in bin/mahout speak, this would look like, munging some
names and taking liberties with the actual argument to be passed in:
>>>>>> bin/mahout svd (original -> svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>> bin/mahout transpose original -> originalT
>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver.
Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>> No
>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS?
Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through
the EigenVerificationJob?
>>>> no
>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.

View raw message