mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Why is Lanczos deprecated?
Date Fri, 02 Aug 2013 22:21:19 GMT
On Fri, Aug 2, 2013 at 3:08 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

>
>
>
> On Fri, Aug 2, 2013 at 2:52 PM, Fernando Fernández <
> fernando.fernandez.gonzalez@gmail.com> wrote:
>
>> I don't agree with k>10 being unlikely meaningful. I've used SVD in text
>> mining problems where k~150 yielded best results (not only a good choice
>> based on plotting eigenvalues and seeing elbow in decay was near 150 but
>> checking results with different k's and seeing around 150 made much more
>> sense). Currently I'm working in a recommender system and already have
>> Lanczos running with k~50 producing best results, again, based on visual
>> exploration of eigenvalues and exploring results one by one and seeing
>> they
>> were more meaningful. Current tests with SSVD are based on the latter and
>> when I say I'm not getting good results I mean Lanczos is working properly
>> on the same problem (I've explored eigenvalues up to 150 and have a good
>> decay) and SSVD is not, but as I said, this might be caused by some bug in
>> the input process, seems to strange to me that results are so different so
>>
>
> Depends on how you define "so". But again, in that respect all i can point
> to is to the accuracy study by N. Halko, out of published work.
>
I guess i can save you digging thru Mahout wiki, here is the reference
http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf.
Specifically, look at eigen values chart comparison at page  179. This is
run on Mahout's Lanczos and SSVD neck-to-neck. The order of accuracy for
first 40 values is claimed as "Order of accuracy is q = 3; q = 2; q = 1,
lanczos, q = 0." (see source for details of accuracy assessment).

One thing i did not understand there is why Lanczos showed such
uncharacteristic values fall-off for values between 40 and 60. I have
always assumed -q=1 was showing something much closer to reality after
first 40 values as well.


>
>> I'll get back to this discussions when I figure it out :) . If you are
>> curious about the numbers: 1MM rows by 150k columns for text mining case
>> and 18 MM rows by 80k columns for recommender.
>>
>> About p and q, I have been playing around with movielens 100k dataset and
>> found q>0 actually worsens results in terms of precision (nothing severe
>> though, but it happens) and its better to increase p a little in that
>> particular case, so my guess is it depends a lot on the dataset though I
>> don't know how.
>>
>
> This again sounds very strange.  The algorithm is non-deterministic, which
> means errors you get in one run, will be different from errors in another
> run, but honesly, you would be the first to report that power iterations
> worsen expectation of an error. All theoretical work and practical
> estimates did not confirm that observation; in fact, quite a bit to the
> contrary.
>
>
>>
>> 2013/8/2 Dmitriy Lyubimov <dlieu.7@gmail.com>
>>
>> > the only time you would not get good results is if spectrum does not
>> have a
>> > good decay. Which is equivalent to mostly same variance in most of
>> original
>> > basis directions. This problem is similar to problem that arises with
>> PCA
>> > when you try to do dimensionality reduction with retaining certain
>> %-tage
>> > of variance. in case of flat spectrum decay, you'd need much bigger k to
>> > retain same amount of variance in dimensionally reduced projection. In
>> that
>> > sense SSVD solution for a given k is as good as PCA gets for the same k.
>> > Also, i believe (but not 100% sure) "problems too small" exhibit higher
>> > errors due to the law of large numbers.
>> >
>> >
>> > On Fri, Aug 2, 2013 at 10:41 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> > wrote:
>> >
>> > > if you use k > 40 you are already beating Lanczos for larger datasets.
>> > > k>10 is unlikely meaninful. p need not be more than 15% of k (default
>> is
>> > > 15). use q=1, q>1 does not yield tangible improvements in real world.
>> > >  Again, see Nathan Halko's dissertation on accuracy comparison.
>> > >
>> > >
>> > >
>> > > On Fri, Aug 2, 2013 at 4:17 AM, Fernando Fernández <
>> > > fernando.fernandez.gonzalez@gmail.com> wrote:
>> > >
>> > >> Keeping Lanczos would be nice, Like I said, it's currently being
>> used in
>> > >> some projects with good results and I think it's easier to tune so
it
>> > >> would
>> > >> be my first choice for future developments. I still need to further
>> test
>> > >> SSVD, specially because in the current example I'm working it yields
>> > very
>> > >> different results from Lanczos. We are investigating if it can be due
>> > to a
>> > >> bug when loading the data, though dimensions of the ouptut seem ok,
>> or
>> > if
>> > >> it's a question of increasing p or q parameters. If it's a question
>> of
>> > >> increasing p and q I think running times would make SSVD not viable.
>> I
>> > >> hope
>> > >> to be able to provide some comparison figures in terms of precision
>> and
>> > >> running time in a month or so.
>> > >>
>> > >> I hope that other users reads this and say wether they are using
>> > Lanczos.
>> > >>
>> > >> Best,
>> > >> Fernando.
>> > >>
>> > >> 2013/8/2 Sebastian Schelter <ssc@apache.org>
>> > >>
>> > >> > I would also be fine with keeping if there is demand. I just
>> proposed
>> > to
>> > >> > deprecate it and nobody voted against that at that point in time.
>> > >> >
>> > >> > --sebastian
>> > >> >
>> > >> >
>> > >> > On 02.08.2013 03:12, Dmitriy Lyubimov wrote:
>> > >> > > There's a part of Nathan Halko's dissertation referenced
on
>> > algorithm
>> > >> > page
>> > >> > > running comparison.  In particular, he was not able to compute
>> more
>> > >> than
>> > >> > 40
>> > >> > > eigenvectors with Lanczos on wikipedia dataset. You may refer
to
>> > that
>> > >> > > study.
>> > >> > >
>> > >> > > On the accuracy part, it was not observed that it was a problem,
>> > >> assuming
>> > >> > > high level of random noise is not the case, at least not
in
>> LSA-like
>> > >> > > application used there.
>> > >> > >
>> > >> > > That said, i am all for diversity of tools, I would actually
be
>> +0
>> > on
>> > >> > > deprecating Lanczos, it is not like we are lacking support
for
>> it.
>> > >> SSVD
>> > >> > > could use improvements too.
>> > >> > >
>> > >> > >
>> > >> > > On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández <
>> > >> > > fernando.fernandez.gonzalez@gmail.com> wrote:
>> > >> > >
>> > >> > >> Hi everyone,
>> > >> > >>
>> > >> > >> Sorry if I duplicate the question but I've been looking
for an
>> > answer
>> > >> > and I
>> > >> > >> haven't found an explanation other than it's not being
used
>> > (together
>> > >> > with
>> > >> > >> some other algorithms). If it's been discussed in depth
before
>> > maybe
>> > >> you
>> > >> > >> can point me to some link with the discussion.
>> > >> > >>
>> > >> > >> I have successfully used Lanczos in several projects
and it's
>> been
>> > a
>> > >> > >> surprise to me finding that the main reason (according
to what
>> I've
>> > >> read
>> > >> > >> that might not be the full story) is that it's not being
used.
>> At
>> > the
>> > >> > >> begining I supposed it was because SSVD is supposed to
be much
>> > faster
>> > >> > with
>> > >> > >> similar results, but after making some tests I have found
that
>> > >> running
>> > >> > >> times are similar or even worse than lanczos for some
>> > configurations
>> > >> (I
>> > >> > >> have tried several combinations of parameters, given
child
>> > processes
>> > >> > enough
>> > >> > >> memory, etc. and had no success in running SSVD at least
in 3/4
>> of
>> > >> time
>> > >> > >> Lanczos runs, thouh they might be some combinations of
>> parameters I
>> > >> have
>> > >> > >> still not tried). It seems to be quite tricky to find
a good
>> > >> > combination of
>> > >> > >> parameters for SSVD and I have seen also a precision
loss in
>> some
>> > >> > examples
>> > >> > >> that makes me not confident in migrating Lanczos to SSVD
from
>> now
>> > on
>> > >> > (How
>> > >> > >> far can I trust results from a combination of parameters
that
>> runs
>> > in
>> > >> > >> significant less time, or at least a good time?).
>> > >> > >>
>> > >> > >> Can someone convince me that SSVD is actually a better
option
>> than
>> > >> > Lanczos?
>> > >> > >> (I'm totally willing to be convinced... :) )
>> > >> > >>
>> > >> > >> Thank you very much in advance.
>> > >> > >>
>> > >> > >> Fernando.
>> > >> > >>
>> > >> > >
>> > >> >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message