From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Soliciting SSVD documentation review
Date Tue, 29 Nov 2011 22:24:58 GMT
ok thanks. I will file an issue for default p.

also i updated the docs re: --reduceTasks.

it would be nice if you could log time for map and reduce phases for
all tasks (it is reported in MR web ui at namenode:50030 by default)
in each case if you think there's a performance issue. It would at
least allow to narrow any problem to a particular part of computation.
My datasets are too small ~10G, and i run them for a rather small k,
at that size i don't see any visible irregularties.

Thanks.
-Dmitriy

On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <nathan@spotinfluence.com> wrote:
> Thanks for the heads up with numReduceTasks.  I haven't changed the
> parameters yet much from the default so this is probably my problem.
>
> By slave I mean machine, I'm running an m1.small as master and either
> m1.small's or m1.large's as slaves (datanode, tasktracker, child).
>
> p depends mostly on the decay of singular values rather than the rank k.
>  In fact (in the analysis at least) it is completely independent of k.  The
> quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this should
> be large.  Ideally we would set p as a function of this parameter which is
> dependent on the matrix (and unknown until we have already solved the
> problem  :-) ).  I suggest 25 since for example tf-idf matrices have a low
> sig/noise ratio.  You could probably for some cases use less, if you need p
> to be larger you probably need a power iteration so it seems to be a good
> default point.  Also the parameter is not an initial point of optimization
> so to error on the larger side is fine.  After all, Lanczos method suggests
> that only 1/3 of singular triplets are accurate, which corresponds to p=2k,
> which is very large.  Basically, the exact value of p is insensitive so
> long as it is large 'enough'.
>
> On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
>
>> PPS also make sure you specify numReduceTasks. Default is I beleive 1
>> which will not scale at multiplication steps at all.
>>
>> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> > PS actually i think it should scale horizontally a little better than
>> > vertically but that's just a guess.
>> >
>> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <nathan@spotinfluence.com>
>> wrote:
>> >>>
>> >>> The docs look great Dmitriy.  Has anyone considered giving oversampling
>> >>> ssvd over lanczos which is promising.  Trying to scale out
>> horizontally but
>> >>> not seeing any difference between using one slave or many slaves.  Any
>> >>> ideas? (I won't go into detail about the setup here but if sounds
>> familiar
>> >>> I'd like to talk more).
>> >>
>> >> What do you mean by a slave? a mapper? a machine?
>> >>
>> >> whether you increase input horizontally or vertically, you should see
>> >> more mappers. If your cluster has enough capacity to scheudle all
>> >> mappers right away, i beleive you will get almost the same time (i.e.
>> >> almost linear scaling) for most of the jobs.
>> >>
>> >>> The basic problem with lanczos in the distributed
>> >>> environment seems to be that a matrix-vector multiply is not enough
>> work to
>> >>> offset any setup costs, also there is not a distributed
>> orthogonalization
>> >>> with lanczos and I'm getting OOM's making it difficult to scale.  I
>> would
>> >>> still like to contribute what results I have found but I'm short on
>> time so
>> >>> nothing besides work directly related to the completion of my thesis
>> will
>> >>> happen until that is done.
>> >>>
>> >>
>> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> >>>
>> >>> > I attached the latex source as well (lyx, actually). I would've
used
>> >>> > Wiki if it supported mathjax. So anyone can modify the usage if
need
>> >>> > be. (Anyone who has lyx anyway).
>> >>> >
>> >>> > Dev docs were attached to several jira issues (and i had blog
>> >>> > entries), if you want to move more recent copies of them moved
over
>> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
>> >>> > one for original method, and another for power iterations, attached
>> to
>> >>> > corresponding jiras.
>> >>> >
>> >>> >
>> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
>> gsingers@apache.org>
>> >>> > wrote:
>> >>> > > I hooked it into the Algorithms page.
>> >>> > >
>> >>> > > How do you intend to keep the PDF up to date?  I like the
focus
>> more on
>> >>> > the user, but it would also be good to have some dev docs.
>> >>> > >
>> >>> > > Also, with both Lanczos and this it would be good if we could
hook
>> them
>> >>> > into some real examples.
>> >>> > >
>> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
>> >>> > >
>> >>> > >> Hi,
>> >>> > >>
>> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd
appreciate
>> if
>> >>> > >> somebody else could look thru it, to scan for completeness
and
>> >>> > >> suggestions.
>> >>> > >>
>> >>> > >> I tried to approach it as a user-facing documentation,
i.e. I
>> tried to
>> >>> > >> avoid discussing any implementation specifics .
>> >>> > >>
>> >>> > >> I had several users and Nathan Halko trying it out and
actually
>> >>> > >> favorably commenting on its scalability vs. Lanczos but
i don't
>> know
>> >>> > >> first hand of any production use (even our own use is
fairly
>> limited
>> >>> > >> (in terms of input volume we ever processed) and actually
somewhat
>> >>> > >> diverged from this Mahout implementation. Perhaps putting
it more
>> in
>> >>> > >> front of users will help to receive more feedback.
>> >>> > >>
>> >>> > >> Thanks.
>> >>> > >> -Dmitriy
>> >>> > >
>> >>> > > --------------------------------------------
>> >>> > > Grant Ingersoll
>> >>> > > http://www.lucidimagination.com
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> >
>>


