mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Sequential access to VectorWritable content proposal.
Date Tue, 14 Dec 2010 08:17:53 GMT
Yah I have the same sort of problem with recommenders. Once in a while you
have a very, very popular item. In those cases I use some smart sampling to
simply throw out some data. For my use cases, that has virtually no effect
on the result so it's valid. Not sure if you can use the same approaches.

On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <> wrote:

> Sean,
> Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
> saw is 128m but most commonly 64.
> Imagine for a second loading 1Gb records into a mapper. Assuming
> sufficiently large amount of nodes, the math expectancy of the collocated
> data in this case is approximately 32M, everything else comes from some
> other node.
> So we start our job with moving ~97% of the data around just to make it
> available to the mappers.
> There are two considerations in my case however that alleviated that
> concern
> after all at least in my particular case somewhat:
> 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
> just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
> for 1Gb vectors, i am saying that the problem is already quite detrimental
> to algorithm running time at much smaller sizes.
> 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors,
> nor do i plan to. But if i did, i strongly suspect it is not I/O that would
> be a bottleneck.
> On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <> wrote:
> > This may be a late or ill-informed comment but --
> >
> > I don't believe there's any issue with VectorWritable per se, no. Hadoop
> > most certainly assumes that one Writable can fit into RAM. A 1GB Writable
> > is
> > just completely incompatible with how Hadoop works. The algorithm would
> > have
> > to be parallelized more then.
> >
> > Yes that may mean re-writing the code to deal with small Vectors and
> such.
> > That probably doesn't imply a change to VectorWritable but to the jobs
> > using
> > it.
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message