mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Out-of-core random forest implementation
Date Fri, 08 Mar 2013 23:03:35 GMT
Oh, certainly. I was thinking in the realm of distributed systems only.
Surely serialization across a network is a necessary step in anything like
that. Serializing to local disk first, or a distributed file system, may
not be. The local writes may not matter. But wouldn't YARN-type setups
still be writing to distributed storage?

My broad hunch is that communicating the same amount of data faster
probably doesn't get an order of magnitude faster, but a different paradigm
that lets you transmit less data does. I was musing about whether M/R
forced you into a hopelessly huge amount of I/O implementation for RF, and
I'm not sure it does, not yet.

These days I continue to want a better sense not just of whether entire
paradigms are more/less suitable and how and when and why, but when two
different concepts in the same paradigm are qualitatively different or just
a different point on a tradeoff curve, optimizing for a different type of

On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <> wrote:

> The big cost in map-reduce iteration isn't just startup.  It is that the
> input has to be read from disk and the output written to same.  Were it to
> stay in memory, things would be vastly faster.
> Also, startup costs are still pretty significant.  Even on MapR, one of the
> major problems in setting the recent minute-sort record was getting things
> to start quickly.  Just setting the heartbeat faster doesn't work on
> ordinary Hadoop because there is a global lock that begins to starve the
> system.  We (our guys, not me) had to seriously hack the job tracker to
> move things out of that critical section.  At that point, we were able to
> shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
> heartbeats per second).  The startup and cleanup tasks are also single
> threaded.
> It might be plausible to shorten to this degree and further on a small
> cluster.  But iteration is still very painful.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message