mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Sequential access to VectorWritable content proposal.
Date Mon, 13 Dec 2010 19:11:29 GMT
Ted,

what Iterable are you talking about? On a Vector? i don't think it helps
because you have to have a Vector loaded (prebuffered) already-- so you
already took the memory for it?
I don't think VectorWritable has an Iterable though.

Iterator is a pull technique, and in MR it would be possible to do something
like this, but Iterable suggests you could do multiple passes over data
which is not possible in context of MR jobs without prebuffering.

Push technique explicitly suggests that one would not be able to do multiple
passes (as in DocumentHandler in SAX standard). So even if we implemented
iterator withough prebuffering vector data, in context of MR job it means
you can't have more than one iterator, which i think is conceptually
incoherent (i.e. misleading) with the Iterable contract.


Thanks.
-Dmitriy

On Mon, Dec 13, 2010 at 11:05 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> How common is it that a row won't fit in memory?  My experience is that
> essentially all rows that
> I am interested will fit in very modest amounts of memory, but that row by
> row handling is imperative.
>
> Is this just gilding the lily?
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message