mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD for dimensional reduction + Kmeans
Date Fri, 10 Aug 2012 19:53:16 GMT
BTW if you really are trying to reduce dimensionality, you may want to
consider --pca option with SSVD, that [i think] will provide with much
better preserved data variance then just clean SVD (i.e. essentially
run a PCA space transformation on your data rather than just SVD)

-d

On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> Got it. Well on to some real and much larger data sets then…
>
> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
> i think actually Mahout's Lanczos requires external knowledge of input
> size too, in part for similar reasons. SSVD doesn't because it doesn't
> have "other" reasons to know input size but fundamental assumption
> rank(input)>=rank(thin SVD) still stands about the input but the
> method doesn't have a goal of verifying it explicitly (which would be
> kind of hard), and instead either produces 0 eigenvectors or runs into
> block deficiency.
>
> It is however hard to assert whether block deficiency stemmed from
> input size deficiency vs. split size deficiency, and neither of
> situations is typical for a real-life SSVD applications, hence error
> message is somewhat vague.
>
> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>> not a method pecularity.
>>
>> The only reason the solution doesn't warn you explicitly is because
>> DistributedRowMatrix format, which is just a sequence file of rows,
>> would not provide us with an easy way to verify what m actually is
>> before it actually iterates over it and runs into block size
>> deficiency. So if you now m as an external knowledge, it is easy to
>> avoid being trapped by block height defiicency.
>>
>>
>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>> This is only a test with some trivially simple data. I doubt there are any splits
and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2,
which is in
>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>> I've attached the modified and running version with testKmeansDSSVD
>>>
>>> As I said I don't think this is a real world test. It tests that the code runs,
and it does. Getting the best results is not part of the scope. I just thought if there was
an easy answer I could clean up the parameters for SSVDSolver.
>>>
>>> Since it is working I don't know that it's worth the effort unless people are
likely to run into this with larger data sets.
>>>
>>> Thanks anyway.
>>>
>>>
>>>
>>>
>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>>
>>> It happens because of internal constraints stemming from blocking. it
>>> happens when a split of A (input) has less than (k+p) rows at which
>>> point blocks are too small (or rather, to short) to successfully
>>> perform a QR on .
>>>
>>> This also means, among other things, k+p cannot be more than your
>>> total number of rows in the input.
>>>
>>> It is also possible that input A is way too wide or k+p is way too big
>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>> in practice i haven't seen such cases in practice yet. If that
>>> happens, there's an option to increase minSplitSize (which would
>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>> not your case.
>>>
>>> But if your input is shorter than k+p, then it is a case too small for
>>> SSVD. in fact, it probably means you can solve test directly in memory
>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>> this case and get exact (non-reduced rank) decomposition equivalent
>>> with no stochastic effects, but that is not what it is for really.
>>>
>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>> and p are?
>>>
>>> thanks.
>>> -D
>>>
>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>>> There seems to be some internal constraint on k and/or p, which is making
a test difficult. The test has a very small input doc set and choosing the wrong k it is very
easy to get a failure with this message:
>>>>
>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>        at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>
>>>> I have a working test but I had to add some docs to the test data and have
tried to reverse engineer the value for k (desiredRank). I came up with the following but
I think it is only an accident that it works.
>>>>
>>>> int p = 15; //default value for CLI
>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ??????
not sure why this works
>>>>
>>>> This seems likely to be an issue only because of the very small data set
and the relationship of rows to columns to p to k. But for the purposes of creating a test
if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions
of the tiny data set it would help.
>>>>
>>>> This test is derived from a non-active SVD test but I'd be up for cleaning
it up and including it as an example in the working but non-active tests. I also fixed a couple
trivial bugs in the non-active Lanczos tests for what it's worth.
>>>>
>>>>
>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>>>
>>>> Reading "overview and usage" doc linked on that page
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>> should help to clarify outputs and usage.
>>>>
>>>>
>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>> Quoth Grant Ingersoll:
>>>>>>> To put this in bin/mahout speak, this would look like, munging
some names and taking liberties with the actual argument to be passed in:
>>>>>>>
>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>> bin/mahout cleansvd ...
>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>> bin/mahout transpose original -> originalT
>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>> bin/mahout kmeans newMatrix
>>>>>>
>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver.
Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>
>>>>> No
>>>>>
>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS?
Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through
the EigenVerificationJob?
>>>>> no
>>>>>
>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean
that
>>>>>
>>>>>> I get errors when I do so trying to figure out if I'm on the wrong
track.
>>>>
>>>
>>>
>

Mime
View raw message