mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Sequential access to VectorWritable content proposal.
Date Mon, 13 Dec 2010 19:27:37 GMT
Ted, there are some vectors where it certainly matters: the eigenvectors of
a
really big matrix are dense, and thus take O(8*num_vertices) bytes to hold
just
one of them in memory.  Doing something sequential with these can certainly
make sense, and in some cases is actually necessary, esp. if done in the
mappers or reducers where there is less memory than you usually have...

  -jake

On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I really don't see this as a big deal even with crazy big vectors.
>
> Looking at web scale, for instance, the most linked wikipedia article only
> has 10 million in-links or so.  On the web, the most massive web site is
> unlikely to have >100 million in-links.  Both of these fit in very modest
> amounts of memory.
>
> Where's the rub?
>
> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Jake,
> > No i was trying exactly what you were proposing some time ago on the
> list.
> > I
> > am trying to make long vectors not to occupy a lot of memory.
> >
> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > saying, hey, there's a lot of sequential techniques that can provide a
> > hander that would inspect vector element-by-element without having to
> > preallocate 8Mb.
> >
> > for 1 million-long vectors it doesn't scary too much but starts being so
> > for
> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > non-zero elements). Stochastic SVD will survive that, but it means less
> > memory for blocking, and the more blocks you have, the more CPU it
> requires
> > (although CPU demand is only linear to the number of blocks and only in
> > signficantly smaller part of computation, so that only insigificant part
> of
> > total CPU flops depends on # of blocks, but there is part that does,
> still.
> > )
> >
> > Like i said, it also would address the case when rows don't fit in the
> > memory (hence no memory bound for n of A) but the most immediate benefit
> is
> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> >
> > -Dmitriy
> >
> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <jake.mannix@gmail.com>
> > wrote:
> >
> > > Hey Dmitriy,
> > >
> > >  I've also been playing around with a VectorWritable format which is
> > backed
> > > by a
> > > SequenceFile, but I've been focussed on the case where it's essentially
> > the
> > > entire
> > > matrix, and the rows don't fit into memory.  This seems different than
> > your
> > > current
> > > use case, however - you just want (relatively) small vectors to load
> > > faster,
> > > right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > > > Interesting idea.
> > > >
> > > > Would this introduce a new vector type that only allows iterating
> > through
> > > > the elements once?
> > > >
> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to submit a patch to VectorWritable that allows for
> > > > streaming
> > > > > access to vector elements without having to prebuffer all of them
> > > first.
> > > > > (current code allows for the latter only).
> > > > >
> > > > > That patch would allow to strike down one of the memory usage
> issues
> > in
> > > > > current Stochastic SVD implementation and effectively open memory
> > bound
> > > > for
> > > > > n of the SVD work. (The value i see is not to open up the the bound
> > > > though
> > > > > but just be more efficient in memory use, thus essentially speeding
> u
> > p
> > > > the
> > > > > computation. )
> > > > >
> > > > > If it's ok, i would like to create a JIRA issue and provide a patch
> > for
> > > > it.
> > > > >
> > > > > Another issue is to provide an SSVD patch that depends on that
> patch
> > > for
> > > > > VectorWritable.
> > > > >
> > > > > Thank you.
> > > > > -Dmitriy
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message