mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Need a little help with SVD / Dimensional Reduction
Date Tue, 07 Jun 2011 20:14:45 GMT
The tradoff is of course the stochastic noise (lanczos would be
significantly more precise, esp. on random data (i.e. without clear
trends), but then why would you try to figure trends on a random
data?), but you get to manage precision by increasing p parameter at
expense of more memory and flops.

(the reuters example i ran was done on k+p=200 singular values of
which LSI probably should use no more than k=50).

On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> You can check what i summed up in javadoc for SSVDSolver to get some idea.
>
> but on top what i have already mentioned, the main difference is that
> it presumably doesn't require as much resources (reuters dataset runs
> thru with default -Xmx200m per task setting just fine) and theoretical
> flops are quite less than that of Lanczos which probably means it
> should rip thru the same volume faster. (my practical run showed that
> it takes about the same time as seq2sparse that produced it's input,
> and that was before sparse vector bug fix which caused significant
> slow down on sparse data).
>
> I don't think you should 'modify' your progam, especially if you need
> to show practical result within a week. but you might help me by
> creating a version of your program that takes it and we can see
> together what it takes to get it working for you.
>
> -d
>
> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>> Before I rewrite my program, is there any advantage over the lanczos svd?
>>
>> 2011/6/7 Dmitriy Lyubimov <dlieu.7@gmail.com>:
>>> I am saying i did not test it with 0.20.2
>>>
>>> Yes it is integrated in 0.5 release but there might be problems with
>>> hadoop 0.20.2
>>>
>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>>>> Hmm... looks nice...
>>>>
>>>> So there is a Lanczos implementation of SVD and the stochastic version
>>>> of SVD. Both produce the doc-concept vectors that I need.
>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>>>> to do some magic to get IntWritable, VectorWritable).
>>>>
>>>> Still, I do not exactly understand what you trying to say. Your SSVD
>>>> does not run with mahout but on CDH (what is this btw?)? Or is it
>>>> available for mahout? So what has to be modified to run it with
>>>> mahout?
>>>>
>>>> Thanks
>>>> Stefan
>>>>
>>>>
>>>>
>>>> 2011/6/7 Dmitriy Lyubimov <dlieu.7@gmail.com>:
>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>>>> produce exact U, V and Sigma ,
>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>>>
>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>>>> necessarily DoubleWritable (you can still keep your original document
>>>>> paths or whatever they are identified with), that info is migrated as
>>>>> keys of the output of U matrix. So you can run it directly on the
>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>>>> dataset example from Mahout In Action).
>>>>>
>>>>> The computation has a stochastic noise to it, but LSI problems are
>>>>> never exact problems anyway and their result are subject to corpus
>>>>> selection.
>>>>>
>>>>> If you are interested to help me to work kinks out in Mahout's
>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>>>> CDH in a customized way. But be warned that it may require a number of
>>>>> patches before it works for you.
>>>>>
>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -D
>>>>>
>>>>>
>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <stefan@wienert.cc>
wrote:
>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>>>
>>>>>> And now your questions:
>>>>>>
>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>>>> similarity to find similar documents. I can also cluster these
>>>>>> documents with k-means...
>>>>>>
>>>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>>>> in other document-similarity-techniques.
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>>>> If I understand your question correctly, you need to simply transpose
M
>>>>>>> before you start the run and that way
>>>>>>> you will get the other singular vectors.
>>>>>>>
>>>>>>> May I ask what is the problem you are working on and why do you
need the
>>>>>>> singular vectors?
>>>>>>> Can you consider using another matrix decomposition technique
for example
>>>>>>> alternating least squares
>>>>>>> which gives you two lower rank matrices which simulates the large
decomposed
>>>>>>> matrix?
>>>>>>>
>>>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <stefan@wienert.cc>
wrote:
>>>>>>>
>>>>>>>> Hi Danny!
>>>>>>>>
>>>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>>>> eigenvectors are identical. But that is not exactly what
I want. The
>>>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M',
which
>>>>>>>> are the left singular vectors of M. But I need the right
singular
>>>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>>>
>>>>>>>> Sorry, my matrix math is not as good as it should be, but
I hope you
>>>>>>>> can help me!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>>>>> > Hi Stefan!
>>>>>>>> > For a positive semidefinite matrix, the lest and right
eigenvectors are
>>>>>>>> > identical.
>>>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>>>> > the decomposition *M* = *U**D**U* * is also a singular
value
>>>>>>>> decomposition.
>>>>>>>> > So you don't need to be worried about the other singular
vectors.
>>>>>>>> >
>>>>>>>> > Hope this helps!
>>>>>>>> >
>>>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <stefan@wienert.cc>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Hi.
>>>>>>>> >>
>>>>>>>> >> Thanks for the help.
>>>>>>>> >>
>>>>>>>> >> The important points from wikipedia are:
>>>>>>>> >> - The left singular vectors of M are eigenvectors
of M*M' .
>>>>>>>> >> - The right singular vectors of M are eigenvectors
of M'*M.
>>>>>>>> >>
>>>>>>>> >> as you describe, the mahout lanczos solver calculate
A=M'*M (I think
>>>>>>>> >> it does A=M*M', but it is not a problem). Therefore
it does already
>>>>>>>> >> calculate the right (or left) singular vector of
M.
>>>>>>>> >>
>>>>>>>> >> But my question is, how can I get the other singular
vector? I can
>>>>>>>> >> transpose M, but then I have to calculated two SVDs,
one for the right
>>>>>>>> >> and one for the left singular value... I think there
is a better way
>>>>>>>> >> :)
>>>>>>>> >>
>>>>>>>> >> Hope you can help me with this...
>>>>>>>> >> Thanks
>>>>>>>> >> Stefan
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>>>>> >> > Hi
>>>>>>>> >> > Mahout SVD implementation computes the Lanzcos
iteration:
>>>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>>>> >> > Denote the non-square input matrix as M. First
a symmetric matrix A is
>>>>>>>> >> > computed by A=M'*M
>>>>>>>> >> > Then an approximating tridiagonal matrix T
and a vector matrix V are
>>>>>>>> >> > computed such that A =~ V*T*V'
>>>>>>>> >> > (this process is done in a distributed way).
>>>>>>>> >> >
>>>>>>>> >> > Next the matrix T is next decomposed into eigenvectors
and
>>>>>>>> eignevalues.
>>>>>>>> >> > Which is the returned result. (This process
>>>>>>>> >> > is serial).
>>>>>>>> >> >
>>>>>>>> >> > The third step makes the returned eigenvectors
orthogonal to each
>>>>>>>> other
>>>>>>>> >> > (which is optional IMHO).
>>>>>>>> >> >
>>>>>>>> >> > The heart of the code is found at:
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>>>> >> > At least that is where it was in version 0.4
I am not sure if there
>>>>>>>> are
>>>>>>>> >> > changes in version 0.5
>>>>>>>> >> >
>>>>>>>> >> > Anyway, Mahout does not compute directly SVD.
If you are interested in
>>>>>>>> >> > learning more about the relation to SVD
>>>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>>>> >> >
>>>>>>>> >> > Hope this helps,
>>>>>>>> >> >
>>>>>>>> >> > Danny Bickson
>>>>>>>> >> >
>>>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert
<stefan@wienert.cc>
>>>>>>>> >> wrote:
>>>>>>>> >> >
>>>>>>>> >> >> After reading this thread:
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>>>> >> >>
>>>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>>>> >> >>
>>>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>>>> >> >>
>>>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>>>> >> >>
>>>>>>>> >> >> Is V = M (U S)* (because this is, what
the calculation in the example
>>>>>>>> >> is)?
>>>>>>>> >> >>
>>>>>>>> >> >> Thanks
>>>>>>>> >> >> Stefan
>>>>>>>> >> >>
>>>>>>>> >> >> 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>>>>>> >> >> >
>>>>>>>> >>
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>> >> >> >
>>>>>>>> >> >> > What is done:
>>>>>>>> >> >> >
>>>>>>>> >> >> > Input:
>>>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937
x 20444
>>>>>>>> >> >> >
>>>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100)
produces the eigenvector (and
>>>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>>>> >> >> >
>>>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs)
20444 x 6076937
>>>>>>>> >> >> >
>>>>>>>> >> >> > transpose svd:
>>>>>>>> >> >> > svd-transpose (terms x concepts) 20444
x 87
>>>>>>>> >> >> >
>>>>>>>> >> >> > matrix multiply:
>>>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose
= result
>>>>>>>> >> >> > (terms x docs) x (terms x concepts)
= (docs x concepts)
>>>>>>>> >> >> >
>>>>>>>> >> >> > so... I do understand, that the "svd"
here is not SVD from
>>>>>>>> wikipedia.
>>>>>>>> >> >> > It only does the Lanczos algorithm
and some magic which produces
>>>>>>>> the
>>>>>>>> >> >> >> Instead either the left or right
(but usually the right)
>>>>>>>> eigenvectors
>>>>>>>> >> >> premultiplied by the diagonal or the square
root of the
>>>>>>>> >> >> >> diagonal element.
>>>>>>>> >> >> > from
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>>>> >> >> >
>>>>>>>> >> >> > so my question: what is the output
of the SVD in mahout. And what
>>>>>>>> do I
>>>>>>>> >> >> > have to calculate to get the "right
singular value" from svd?
>>>>>>>> >> >> >
>>>>>>>> >> >> > Thanks,
>>>>>>>> >> >> > Stefan
>>>>>>>> >> >> >
>>>>>>>> >> >> > 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>>>>>> >> >> >>
>>>>>>>> >> >>
>>>>>>>> >>
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>>>> >> >> >>  --arg --numRowsA --arg 20444
\
>>>>>>>> >> >> >>  --arg --numColsA --arg 6076937
\
>>>>>>>> >> >> >>  --arg --numRowsB --arg 20444
\
>>>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>>>> >> >> >> so the result is a 6,076,937 x
87 matrix
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> the input has 6,076,937 (each
with 20,444 terms). so the result of
>>>>>>>> >> >> >> matrix multiplication has to be
the right singular value regarding
>>>>>>>> to
>>>>>>>> >> >> >> the dimensions.
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> so the result is the "concept-document
vector matrix" (as I think,
>>>>>>>> >> >> >> these is also called "document
vectors" ?)
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> 2011/6/6 Ted Dunning <ted.dunning@gmail.com>:
>>>>>>>> >> >> >>> Yes.  These are term vectors,
not document vectors.
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>> There is an additional step
that can be run to produce document
>>>>>>>> >> >> vectors.
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16
PM, Stefan Wienert <stefan@wienert.cc
>>>>>>>> >
>>>>>>>> >> >> wrote:
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>>> compared to SVD, is the
result is the "right singular value"?
>>>>>>>> >> >> >>>>
>>>>>>>> >> >> >>>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> --
>>>>>>>> >> >> >> Stefan Wienert
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> http://www.wienert.cc
>>>>>>>> >> >> >> stefan@wienert.cc
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>>>> >> >> >>
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> > --
>>>>>>>> >> >> > Stefan Wienert
>>>>>>>> >> >> >
>>>>>>>> >> >> > http://www.wienert.cc
>>>>>>>> >> >> > stefan@wienert.cc
>>>>>>>> >> >> >
>>>>>>>> >> >> > Telefon: +495251-2026838
>>>>>>>> >> >> > Mobil: +49176-40170270
>>>>>>>> >> >> >
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> --
>>>>>>>> >> >> Stefan Wienert
>>>>>>>> >> >>
>>>>>>>> >> >> http://www.wienert.cc
>>>>>>>> >> >> stefan@wienert.cc
>>>>>>>> >> >>
>>>>>>>> >> >> Telefon: +495251-2026838
>>>>>>>> >> >> Mobil: +49176-40170270
>>>>>>>> >> >>
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Stefan Wienert
>>>>>>>> >>
>>>>>>>> >> http://www.wienert.cc
>>>>>>>> >> stefan@wienert.cc
>>>>>>>> >>
>>>>>>>> >> Telefon: +495251-2026838
>>>>>>>> >> Mobil: +49176-40170270
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> stefan@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>

Mime
View raw message