mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Sequential access to VectorWritable content proposal.
Date Tue, 14 Dec 2010 08:25:16 GMT
Jake, yes, just looked at VW code again. I don't see any problem equipping
it with an element handler. Even when it's a sparse random access.



So, driven by practical delivery goals, i'd have to push on with hacking
Vector Writable for now. It's not clear if this will result in significant
decrease in SSVD running time but I think it will help me to get thru
current RAM starving nuisances.  I am staying fairly unconvinced that
there's been a really strong argumentation against providing vector
preprocessing capability of some sort so far, or that push preprocessor
technique is somehow not-so-elegant (or should i say ugly?) in itself. It's
a pattern that is well understood and widely used in the community. Esp. for
as long as javadoc is very explicit about it. I gather there's little
interest in this sort of memory allocation improvement. So I'll keep this
hack private, np, until some sort of spliced record or blocked matrix format
is matured.  The existing patch is certainly very usable as is, esp. if one
can afford some extra memory in child processes.


I fully support the notion  that eventually some sort of blocking format is
most promising for 1Gb vectors, but i think for my purposes i can get away
with just this and some MR optimizations for split sizes and associated IO.
But the more i think about it, the more convinced i become that SSVD on
wider matrices are preferrable over tall matrices as it would allow to
reduce amount of data ciculation thru shuffle-and-sort circuitry as well as
the number of blocks in SSVD computations. So it kind of makes wider
matrices more scalable. Althought i don't think that a billion rows tall
matrix is a problem for anything but CPU in context of SSVD either.


thanks.
-Dmitriy

On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <jake.mannix@gmail.com> wrote:

> Check the source for VectorWritable, I'm pretty sure it serializes
> in the order of the nonDefaultIterator(), which for SASVectors is in order,
> so while these are indeed non-optimal for random access and mutating
> operations, that is indeed the tradeoff you have to make when picking
> your vector impl.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > Yes, it should be. I thought Ted implied VectorWritable does it only this
> > way and non other.
> >
> > If we can differentiate I'd rather do it. Implying that if you save in
> one
> > format (non-sequential) we'd support it with caveat that it's subpar in
> > certain cases whereas where you want to format input sequentially, we'd
> > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> Jake.
> >
> > -d
> >
> >
> > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <jake.mannix@gmail.com>
> > wrote:
> >
> > > Dmitriy,
> > >
> > >  You should be able to specify that your matrices be stored in
> > > SequentialAccessSparseVector format if you need to.  This is
> > > almost always the right thing for HDFS-backed matrices, because
> > > HDFS is write-once, and SASVectors are optimized for read-only
> > > sequential access, which is your exact use case, right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > >
> > > > I don't think sequentiality is a requirement in the case i am working
> > on.
> > > > However, let me peek at the code first. I am guessing it is some form
> > of
> > > a
> > > > near-perfect hash, in which case it may not be possible to read it in
> > > parts
> > > > at all. Which would be bad, indeed. I would need to find a completely
> > > > alternative input format then to overcome my case.
> > > >
> > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <ted.dunning@gmail.com>
> > > > wrote:
> > > >
> > > > > I don't thikn that sequentiality part of the contract.
> > > > >  RandomAccessSparseVectors are likely to
> > > > > produce disordered values when serialized, I think.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I will have to look at details of VectorWritable to make sure
all
> > > cases
> > > > > are
> > > > > > covered (I only took a very brief look so far). But as long
as it
> > is
> > > > able
> > > > > > to
> > > > > > produce elements in order of index increase, push technique
will
> > > > > certainly
> > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > even
> > > if
> > > > > it
> > > > > > produces the data in non-sequential way, it would work too )
.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message