mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD compute U * Sigma
Date Fri, 07 Sep 2012 21:27:30 GMT
On Fri, Sep 7, 2012 at 2:12 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> Or clustering cannot map vectors to clusters since the KEYs are lost and only the IDs
are kept, same for rowsimilarity. The clusteredPoints/part file contains a mapping of cluster
ID to vector ID, neither of which is a KEY afaik.

I guess i still have problem understanding your definition of ID. Do
you mean that is returned by NamedVector#getName()?

>
> However it looks like the jobs are preserving the VectorWritable so I'm not sure why
the IDs are not there. Let me look deeper.

It is just because named vector is not supported. There were no use
case for that before you. Sequence file keys, on the other hand, is
what populated by seq2sparse output, so they are useful for mapping
results to original documents.

>
> On Sep 7, 2012, at 1:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
> just checked again. yes U inherits keys from sequence file of A. (are
> we talking about keys of the sequence files or names of a NamedVector?
> NamedVector is not supported, keys of sequence files are.)
>
> On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> More specifically, the way it works, Q matrix inherits keys of A rows
>> (BtJob line 137), and U inherits keys of Q (UJob line 128).
>>
>> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>>>
>>>> However I still have no doc ids in U. Has anyone seen a case where they are
preserved?
>>>
>>> That should not be the case. Ids in rows of U are inherited from rows
>>> of A. (should be at least).
>>>
>>>>
>>>> For
>>>>   BtJob.run(conf,
>>>>               inputPath,
>>>>               qPath,
>>>>               pcaMeanPath,
>>>>               btPath,
>>>>               minSplitSize,
>>>>               k,
>>>>               p,
>>>>               outerBlockHeight,
>>>>               q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>>               broadcast,
>>>>               labelType,
>>>>               q <= 0);
>>>>
>>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>>>
>>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look
for them and BtJob is the only place the input matrix is used, the rest are intermediates
afaict and anyway don't have ids either.
>>>>
>>>> Is something in BtJob stripping them? It looks like ids are ignored in the
MR code but maybe its hidden…
>>>>
>>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct
an index for A and use it on U but it would be nice to get them out of the solver.
>>>
>>> Yes, that's the idea.
>>>
>>> B^t matrix will not have the ideas, not sure why you are looking
>>> there. you need U matrix. Which is solved by another job.
>>>
>>>>
>>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>>>
>>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pat.ferrel@gmail.com> wrote:
>>>>
>>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>>>
>>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>>> proportions for doing clustering and similarity (though not sure if this
is
>>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>>> instead of sqrt Sigma.
>>>>>
>>>>> Since I have no other reason to load U row by row I could write another
>>>>> transform and keep it out of the mahout core but doing this in a patch
>>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>>> option looks like the hardest part actually). For the API there needs
to be
>>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do
an
>>>>> extra flag check in the setup for the UJob, something like the following
>>>>>
>>>>>    if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>>      sValues = SSVDHelper.loadVector(sigmaPath,
>>>>> context.getConfiguration());
>>>>>      // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>>> for Sigma weighting
>>>>>    }
>>>>>
>>>>> sValues is already applied to U in the map, which would remain unchanged
>>>>> since the sigma weighted (instead of sqrt sigma) values will already
be in
>>>>> sValues.
>>>>>
>>>>>    if (sValues != null) {
>>>>>      for (int i = 0; i < k; i++) {
>>>>>        uRow.setQuick(i,
>>>>>                      qRow.dot(uHat.viewColumn(i)) *
>>>>> sValues.getQuick(i));
>>>>>      }
>>>>>    } else {
>>>>>      …
>>>>>
>>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>>>
>>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>>>
>>>>>> When using PCA it's also preferable to use --uHalfSigma to create
U with
>>>>> the SSVD solver. One difficulty is that to perform the multiplication
you
>>>>> have to turn the singular values vector (diagonal values) into a
>>>>> distributed row matrix or write your own multiply function, correct?
>>>>>
>>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>>> additionally encoded as a very short vector of singular values of
>>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>>> reason to encode it as Distributed row matrix.
>>>>>
>>>>> Multiplication can be done on the fly as you load U, row by row:
>>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>>>
>>>>> One inconvenience with that approach is that it assumes you can freely
>>>>> hack the code that loads U matrix for further processing.
>>>>>
>>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>>> same logic as above (requires a patch) or just have it output
>>>>> U*Sigma^0.5 (does not require a patch).
>>>>>
>>>>> You could even use U in some cases directly, but part of the problem
>>>>> is that data variances will be normalized in all directions compared
>>>>> to PCA space, which will affect actual distances between data points.
>>>>> If you want to retain proportions of the directional variances as in
>>>>> your original input, you need to use principal components with scaling
>>>>> applied, i.e. U*Sigma.
>>>>>
>>>>>
>>>>>
>>>>
>
>

Mime
View raw message