mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Out-of-core random forest implementation
Date Fri, 08 Mar 2013 22:35:17 GMT
The big cost in map-reduce iteration isn't just startup.  It is that the
input has to be read from disk and the output written to same.  Were it to
stay in memory, things would be vastly faster.

Also, startup costs are still pretty significant.  Even on MapR, one of the
major problems in setting the recent minute-sort record was getting things
to start quickly.  Just setting the heartbeat faster doesn't work on
ordinary Hadoop because there is a global lock that begins to starve the
system.  We (our guys, not me) had to seriously hack the job tracker to
move things out of that critical section.  At that point, we were able to
shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
heartbeats per second).  The startup and cleanup tasks are also single

It might be plausible to shorten to this degree and further on a small
cluster.  But iteration is still very painful.

On Fri, Mar 8, 2013 at 9:03 AM, Sean Owen <> wrote:

> Weighing in late here -- it's an interesting question whether M/R is a good
> fit for iterative processes. Today you can already optimize away most of
> the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing
> heart-beat rate, etc. I know Ted can add more about how MapR tunes that
> further. So I'm not sure it's the overhead of starting jobs per se.
> Unless your iterations are < 1 minute or something. And they may be in some
> situations. I don't think that's true of, say, ALS or even the RF
> implementation I have in mind. Iterations may be really fast at small scale
> too, but that's not what Hadoop is for. Or unless your iterations have
> significant init overhead, like loading data. That's a good point too.
> I think the problem comes if the process is not naturally iterative -- if
> it's parallel, but, the workers need not stop to sync up, then forcing them
> into an iterative process just wastes time. Most time  you're waiting for a
> straggler worker, needlessly. In this regard, I am not sure that a BSP
> paradigm is any better? But not sure anyone was pushing BSP.
> But I think RF could fit an iterative scheme well. a) I haven't thought it
> through completely or tried it, and b) I can imagine reasons it may work in
> theory but not in practice. But is this not roughly how you'd do it --
> maybe this is what's already being suggested? Roughly:
> Break up the data by feature. (Subdivide features if needed.) Map over all
> the data and distribute the single-feature data. Reducers compute an
> optimal split / decision rule and output their rules. Next iteration:
> mappers read the rules and choose a next random feature for each rule. They
> map the data, apply each rule to each record, apply the next choice of
> feature, and output the single feature again. Reducers will receive, in one
> input group, again all the data they need to compute another split. They
> output that split. Repeat.
> There's a lot of devil in the details there. But this seems like a fine way
> to build a tree level by level without sending all the data all over the
> place. I suppose the problem is uneven data distribution, and that some
> decision trees will go deep. But quickly your number of input groups gets
> quite large as they get smaller, so the it ought to even out over reducers
> (?).
> To answer Marty, I don't this project will never change much from what it
> is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An
> MRv2-based project is a different project, as it would properly be a total
> rewrite. Something to start thinking about and start thinking about drawing
> a line under what's here IMHO.
> On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube <
>> wrote:
> > What about using one map reduce job per iteration?  The models you load
> > into distributed cache are the model from the last round and the reducer
> > can emit the expanded model.  We are presumably working with large data
> > sets so I would not expect start-up latency to be an issue.
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message