On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> so, assuming 500 oversampled svalues is equivalent to perhaps 300
> 'good' values.... depending on decay... so 300 singular values would
> require 300 passes over the whole input? or only subpart of it?
> Given it takes about 20 s just to set up a MR run and 10 sec to
> confirm it's completion, that's just what... about 100150 minutes
> just in initialization time?
>
In general, yes, DistributedLanczosSolver is dominated by startup
costs for nearly all data sets I've used.
> Also, the size of the problem must also affect sorting i/o time
> (unless all jobs are maponly, but i don't think they can be). That's
>
And they're not maponly, there is a shuffle on every pass, but the
combiners are pretty will utilized, so the shuffle is pretty small.
> kind of at least proportional to the size of the input. so I guess
> problem size does matter, not just the # of available slots for the
> mappers.
>
>
> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > Hmmm... that's a really tiny data set. Lanczosbased SVD, for k singular
> > values, requires k passes over the data, and each row which has d
> nonzero
> > entries will do d^2 computations in each pass. So if there are n rows in
> > the
> > data set, it's k*n*d^2 if all rows are the same size.
> > I guess "how long" depends on how big the cluster is!
> >
> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>
> >> Jake, since we are on the topic, what's the running times of Lanczos
> >> on a ~1G worth sequence file input might be?
> >>
> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.mannix@gmail.com>
> >> wrote:
> >> >
> >> >
> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> you can certainly try to write it out into a DRM (distributed row
> >> >> matrix) and run stochastic SVD on hadoop (off the trunk now). see
> >> >> MAHOUT593. This is suitable if you have a good decay of singular
> >> >> values (but if you don't it probably just means you have so much
> noise
> >> >> that it masks the problem you are trying to solve in your data).
> >> >
> >> > You don't need to run it as stochastic, either. The regular
> >> > LanczosSolver
> >> > will work on this data, if it lives as a DRM.
> >> >
> >> > jake
> >
> >
>
