Oh, certainly. I was thinking in the realm of distributed systems only.
Surely serialization across a network is a necessary step in anything like
that. Serializing to local disk first, or a distributed file system, may
not be. The local writes may not matter. But wouldn't YARN-type setups
still be writing to distributed storage?
My broad hunch is that communicating the same amount of data faster
probably doesn't get an order of magnitude faster, but a different paradigm
that lets you transmit less data does. I was musing about whether M/R
forced you into a hopelessly huge amount of I/O implementation for RF, and
I'm not sure it does, not yet.
These days I continue to want a better sense not just of whether entire
paradigms are more/less suitable and how and when and why, but when two
different concepts in the same paradigm are qualitatively different or just
a different point on a tradeoff curve, optimizing for a different type of
problem.
On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The big cost in map-reduce iteration isn't just startup. It is that the
> input has to be read from disk and the output written to same. Were it to
> stay in memory, things would be vastly faster.
>
> Also, startup costs are still pretty significant. Even on MapR, one of the
> major problems in setting the recent minute-sort record was getting things
> to start quickly. Just setting the heartbeat faster doesn't work on
> ordinary Hadoop because there is a global lock that begins to starve the
> system. We (our guys, not me) had to seriously hack the job tracker to
> move things out of that critical section. At that point, we were able to
> shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
> heartbeats per second). The startup and cleanup tasks are also single
> threaded.
>
> It might be plausible to shorten to this degree and further on a small
> cluster. But iteration is still very painful.
>
>
|