mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Sequential access to VectorWritable content proposal.
Date Mon, 13 Dec 2010 19:05:48 GMT
Jake,
No i was trying exactly what you were proposing some time ago on the list. I
am trying to make long vectors not to occupy a lot of memory.

E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
saying, hey, there's a lot of sequential techniques that can provide a
hander that would inspect vector element-by-element without having to
preallocate 8Mb.

for 1 million-long vectors it doesn't scary too much but starts being so for
default hadoop memory settings at the area of 50-100Mb (or 5-10 million
non-zero elements). Stochastic SVD will survive that, but it means less
memory for blocking, and the more blocks you have, the more CPU it requires
(although CPU demand is only linear to the number of blocks and only in
signficantly smaller part of computation, so that only insigificant part of
total CPU flops depends on # of blocks, but there is part that does, still.
)

Like i said, it also would address the case when rows don't fit in the
memory (hence no memory bound for n of A) but the most immediate benefit is
to speed/ scalability/memory req of SSVD in most practical LSI cases.

-Dmitriy

On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <jake.mannix@gmail.com> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message