mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: is it possible to compute the SVD for a large scale matrix
Date Wed, 06 Apr 2011 19:01:30 GMT
On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> so, assuming 500 oversampled svalues is equivalent to perhaps 300
> 'good' values.... depending on decay... so 300 singular values would
> require 300 passes over the whole input? or only sub-part of it?
> Given it takes about 20 s just to set up a MR run and 10 sec to
> confirm it's completion, that's just what... about 100-150 minutes
> just in initialization time?
>

In general, yes, DistributedLanczosSolver is dominated by startup
costs for nearly all data sets I've used.


> Also, the size of the problem must also affect sorting i/o time
> (unless all jobs are map-only, but i don't think they can be). That's
>

And they're not map-only, there is a shuffle on every pass, but the
combiners are pretty will utilized, so the shuffle is pretty small.


> kind of at least proportional to the size of the input. so I guess
> problem size does matter, not just the # of available slots for the
> mappers.
>
>
> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
> > values, requires k passes over the data, and each row which has d
> non-zero
> > entries will do d^2 computations in each pass.  So if there are n rows in
> > the
> > data set, it's k*n*d^2 if all rows are the same size.
> > I guess "how long" depends on how big the cluster is!
> >
> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>
> >> Jake, since we are on the topic, what's the running times of Lanczos
> >> on a ~1G worth sequence file input might be?
> >>
> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.mannix@gmail.com>
> >> wrote:
> >> >
> >> >
> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> you can certainly try to write it out into a DRM (distributed row
> >> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
> >> >> MAHOUT-593. This is suitable if you have a good decay of singular
> >> >> values (but if you don't it probably just means you have so much
> noise
> >> >> that it masks the problem you are trying to solve in your data).
> >> >
> >> > You don't need to run it as stochastic, either.  The regular
> >> > LanczosSolver
> >> > will work on this data, if it lives as a DRM.
> >> >
> >> >   -jake
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message