Thanks, this is very helpful. Your explanation of how SSVD works and the dense matrices it
produces may explain why calculating U * Sigma * V^t is not practical. I think you are saying
that it is not practical, not that the logic/math is wrong? But anyway here is my reasoning
and maybe there is another way to skin the cat…
First I don't require LSA/PCA, it just seems like a useful technique given the application
below. I do require the benefits of DR on the assumption that it can be used to reduce the
affect of input noise. My assumption is that PCA will add the benefit of amplifying the important
factors in the input, though this must be verified with use.
Goals:
• Cluster and calculate vector similarity on input data taken from web pages. This works
now but noise in the term space is often pretty high even after applying term scrubbing techniques
like stemming, TFIDF, use of the smart parser Boilerpipe, etc.
• Since the data is somewhat noisy, project the A vectors into LSA/PCA space perfroming
DR at the same time. Then run the projected vectors through clustering and vector similarity.
This can be done by running analysis on U * Sigma (if I can figure out how to get clustering
to recognize the input format of U * Sigma and this still blocks me).
• Project the vectors back into the original term space. Here the assumption is that the
back projected vectors will be "cleaner" in some sense.
• Drop the back projected vectors into an existing termbased UI, and enable other termbased
analysis like vectorbased Solr queries.
I can give several examples of how the termspace representation of vectors is used. Suffice
it to say they all work fine with a nonLSA analysis pipeline. Basically now I'm trying to
use some DR and/or LSA/PCA techniques to improve results but still need to look at results
in term space. If this can't be done maybe I have to rely on projected vectors for some analysis
and use the original vectors for other analysis or UI.
If my logic is basically sound then the question remains: How do we get vectors through DR
and/or LSA/PCA and back into term space?
On Sep 10, 2012, at 9:12 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
It is in the doc, but in short, the gist of pca options is as follows:
Normally, PCA conversion is projection onto (offset) orthogonal basis
subspace which retains most of variances in the original data. It so
happens that such basis corresponds to right singular vectors of (AM)
hence the projection is formed by (AM)*V, which is approximated by
U*Sigma of reduced rank svd of (AM). (Assuming row vector data
points). This is the scope of our PCA task here.
Now, consider a typical document corpus of say 1 million documents.
Typically (depending on stemming) total dictionary size would be
anywhere between 10^5 to 10^6, but let's assume 10^5 for our purposes.
However, indvidual document will perhaps have 100 terms on average.
Thus the PCA input for such corpus will be a 1M x 100k sparse matrix
but the number of nonzero elements in it will be just 1M x 100 ==
100M. This is a fairly small problem for an SSVD to solve.
However, we want to run SVD on (AM) where each row of M is a column
mean of A. Column mean is a dense matrix, hence M is a dense matrix
too, and so is the (AM).
Then in our hypothetical case we observe that (AM) takes 1M x 100k
nonzero elements, which means it takes 1000 times more space than A
and its ssvd will take, asymptotically speaking, (1000)^(3/2) more
llops than A, that is almost 5 orders of magnitude more svd
computational difference.
However, SSVD algorithm may be amended so that it doesn't have to ever
compute svd(AM). I will not go into detail, but in essence the bottom
line is that it reduces computational expense to that being equivalent
to svd(A), i.e. almost 5 orders of magnitude in our hypothetical
corpus PCA case.
Hope it makes things clearer.
On Mon, Sep 10, 2012 at 7:25 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> Another issue with the SSVD job options is that if you want to use SSVD but keep the
input in the original dimentionspace (termspace in many cases) you would do the following
> * create input matrix A based on input dimensions (terms)
> * calculate the full transform, which retains the output in "termspace" AHat = U*Sigma*V^t
> * at the end AHat should be in termspace but transformed by DR and Sigma weighted for
PCA, right?
> * then AHat can be substituted for A where analysis or examination needs the original
dimension definitions (terms).
>
> The problem with the options is that when you set uHalfSigma OR vHalfSigma it sets
the sVectors to sqrt and that will cause it to be applied to U and V since UJob and VJob only
check to see if the sVectors exist and then they both apply them. In other words, either uHalfSigma
is set OR vHalfSigma will apply sqrt Sigma to BOTH U and V.
>
> To do U*Sigma*V^t the SSVD code would have to be changed or the U * Sigma would have
to be calculated outside SSVD (an ugly alternative).
>
> But please correct me where I'm wrong.
>
>
> On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> I appreciate it and believe that this will help others too. I also agree that we should
think this one through to see if it is the correct approach.
>
> I need to figure out why the row ids/Keys of the input matrix are not getting through
clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering?
In other words with rowsimilarity I can map the results back to the original input rows (documents
in my case).
>
> As to the U*S option, I agree. I modified the code to take a new uSigma but it is mutually
exclusive of uHalfSigma and that indicates that neither option should be a boolean. They
both also imply calculation of U. I assume this is what you meant below.
>
> Another thing to consider and here my ignorance shows through… The U*S (equivalent
to A*V) transform to Vspace must be reversible so that humans can see results in terms of
of the original termspace. Weights base on the new basis are not human understandable really.
But setting me straight here may be another conversation.
>
> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
> I can do a patch to propagate names of named vectors from A to U too
> if that's a requirement for what you do. But we need to make sure it
> solves your problem. i am still not sure what are IDs in your
> definition and what is required for kmeans.
>
> Thinking of that, it's probably a worthy patch anyway. I'll write
> something up along with API changes for A*Sigma outputs. I think since
> there are so many output options, they should be redesigned not to be
> mutually exclusive.
>
> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is
easy enough.
>>
>> I'm not getting any id at all in the cluster data, not even a key for a row.
>>
>> I'm beginning to think this is a clustering problem since rowsimilarity at least
gives me row keys to identify objects associated with an object.
>>
>> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>> yeah seq2sparse seems to have nv option now to churn out named
>> vectors too. It doesn't seem to be listed in the MIA book though.
>>
>> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>> Sequence file keys, on the other hand, is
>>>> what populated by seq2sparse output, so they are useful for mapping
>>>> results to original documents.
>>>
>>> Although honestly i am not so sure about seq2sparse anymore. There has
>>> been some time since i looked at this for the last time.
>>
>
>
