# mahout-dev mailing list archives

##### Site index · List index
Message view
Top
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Soliciting SSVD documentation review
Date Wed, 30 Nov 2011 18:49:31 GMT
Thank you, Nathan.

On Wed, Nov 30, 2011 at 9:49 AM, Nathan Halko <nathan@spotinfluence.com>wrote:

> Yes I will time the phases.  My largest dataset is only a couple of gigs
> currently, I ran into the 5G limit on Amazon S3 and need to find a work
> around.  But I figured that might be large enough to see scaling using the
> small instances but maybe not.  I will work on these issues and see what
> happens, thanks for you help Dmitriy.
>
> Nathan
>
> On Tue, Nov 29, 2011 at 3:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > ok thanks. I will file an issue for default p.
> >
> > also i updated the docs re: --reduceTasks.
> >
> > it would be nice if you could log time for map and reduce phases for
> > all tasks (it is reported in MR web ui at namenode:50030 by default)
> > in each case if you think there's a performance issue. It would at
> > least allow to narrow any problem to a particular part of computation.
> > My datasets are too small ~10G, and i run them for a rather small k,
> > at that size i don't see any visible irregularties.
> >
> > Thanks.
> > -Dmitriy
> >
> > On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <nathan@spotinfluence.com>
> > wrote:
> > > Thanks for the heads up with numReduceTasks.  I haven't changed the
> > > parameters yet much from the default so this is probably my problem.
> > >
> > > By slave I mean machine, I'm running an m1.small as master and either
> > > m1.small's or m1.large's as slaves (datanode, tasktracker, child).
> > >
> > > p depends mostly on the decay of singular values rather than the rank
> k.
> > >  In fact (in the analysis at least) it is completely independent of k.
> >  The
> > > quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this
> > should
> > > be large.  Ideally we would set p as a function of this parameter which
> > is
> > > dependent on the matrix (and unknown until we have already solved the
> > > problem  :-) ).  I suggest 25 since for example tf-idf matrices have a
> > low
> > > sig/noise ratio.  You could probably for some cases use less, if you
> > need p
> > > to be larger you probably need a power iteration so it seems to be a
> good
> > > default point.  Also the parameter is not an initial point of
> > optimization
> > > so to error on the larger side is fine.  After all, Lanczos method
> > suggests
> > > that only 1/3 of singular triplets are accurate, which corresponds to
> > p=2k,
> > > which is very large.  Basically, the exact value of p is insensitive so
> > > long as it is large 'enough'.
> > >
> > > On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> PPS also make sure you specify numReduceTasks. Default is I beleive 1
> > >> which will not scale at multiplication steps at all.
> > >>
> > >> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > >> wrote:
> > >> > PS actually i think it should scale horizontally a little better
> than
> > >> > vertically but that's just a guess.
> > >> >
> > >> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > >> wrote:
> > >> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <
> > nathan@spotinfluence.com>
> > >> wrote:
> > >> >>>
> > >> >>> The docs look great Dmitriy.  Has anyone considered giving
> > oversampling
> > >> >>> ssvd over lanczos which is promising.  Trying to scale out
> > >> horizontally but
> > >> >>> not seeing any difference between using one slave or many
slaves.
> >  Any
> > >> >>> ideas? (I won't go into detail about the setup here but if
sounds
> > >> familiar
> > >> >>> I'd like to talk more).
> > >> >>
> > >> >> What do you mean by a slave? a mapper? a machine?
> > >> >>
> > >> >> whether you increase input horizontally or vertically, you should
> see
> > >> >> more mappers. If your cluster has enough capacity to scheudle
all
> > >> >> mappers right away, i beleive you will get almost the same time
> (i.e.
> > >> >> almost linear scaling) for most of the jobs.
> > >> >>
> > >> >>> The basic problem with lanczos in the distributed
> > >> >>> environment seems to be that a matrix-vector multiply is not
> enough
> > >> work to
> > >> >>> offset any setup costs, also there is not a distributed
> > >> orthogonalization
> > >> >>> with lanczos and I'm getting OOM's making it difficult to
scale.
>  I
> > >> would
> > >> >>> still like to contribute what results I have found but I'm
short
> on
> > >> time so
> > >> >>> nothing besides work directly related to the completion of
my
> thesis
> > >> will
> > >> >>> happen until that is done.
> > >> >>>
> > >> >>
> > >> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > >> wrote:
> > >> >>>
> > >> >>> > I attached the latex source as well (lyx, actually).
I would've
> > used
> > >> >>> > Wiki if it supported mathjax. So anyone can modify the
usage if
> > need
> > >> >>> > be. (Anyone who has lyx anyway).
> > >> >>> >
> > >> >>> > Dev docs were attached to several jira issues (and i
> > >> >>> > entries), if you want to move more recent copies of them
moved
> >  over
> > >> >>> > to wiki, i'd be happy to. Mainly, so far there are 2
working
> > notes,
> > >> >>> > one for original method, and another for power iterations,
> > attached
> > >> to
> > >> >>> > corresponding jiras.
> > >> >>> >
> > >> >>> >
> > >> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
> > >> gsingers@apache.org>
> > >> >>> > wrote:
> > >> >>> > > I hooked it into the Algorithms page.
> > >> >>> > >
> > >> >>> > > How do you intend to keep the PDF up to date?  I
like the
> focus
> > >> more on
> > >> >>> > the user, but it would also be good to have some dev
docs.
> > >> >>> > >
> > >> >>> > > Also, with both Lanczos and this it would be good
if we could
> > hook
> > >> them
> > >> >>> > into some real examples.
> > >> >>> > >
> > >> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> > >> >>> > >
> > >> >>> > >> Hi,
> > >> >>> > >>
> > >> >>> > >> I put a usage and overview doc for SSVD onto
wiki. I'd
> > appreciate
> > >> if
> > >> >>> > >> somebody else could look thru it, to scan for
completeness
> and
> > >> >>> > >> suggestions.
> > >> >>> > >>
> > >> >>> > >> I tried to approach it as a user-facing documentation,
i.e. I
> > >> tried to
> > >> >>> > >> avoid discussing any implementation specifics
.
> > >> >>> > >>
> > >> >>> > >> I had several users and Nathan Halko trying
it out and
> actually
> > >> >>> > >> favorably commenting on its scalability vs.
Lanczos but i
> don't
> > >> know
> > >> >>> > >> first hand of any production use (even our own
use is fairly
> > >> limited
> > >> >>> > >> (in terms of input volume we ever processed)
and actually
> > somewhat
> > >> >>> > >> diverged from this Mahout implementation. Perhaps
putting it
> > more
> > >> in
> > >> >>> > >> front of users will help to receive more feedback.
> > >> >>> > >>
> > >> >>> > >> Thanks.
> > >> >>> > >> -Dmitriy
> > >> >>> > >
> > >> >>> > > --------------------------------------------
> > >> >>> > > Grant Ingersoll
> > >> >>> > > http://www.lucidimagination.com
> > >> >>> > >
> > >> >>> > >
> > >> >>> > >
> > >> >>> > >
> > >> >>> >
> > >>
> >
>


Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message