hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: Hadoop over the internet
Date Tue, 20 Apr 2010 21:13:30 GMT
On Tue, Apr 20, 2010 at 12:53 PM,  <altanis@ceid.upatras.gr> wrote:
> Thank you for your answers. I also thought bandwidth would be the main
> problem. However the thought wasn't so much a SETI-type approach, but a
> cooperation between large datacenters. Do you think things would be
> different if you assume that the bandwidth of the participants is
> abundant?
> @Eric Sammers
> Could you elaborate on pipe line replication a bit more? The way I
> understood it, the input is copied to one DataNode from the client, and
> then to another from the first DataNode and so on.

This is what is meant by pipe line replication, yes. A writes to B
writes to C. It's done in a stream (unless a data node dies) rather
than at the full block level, the way I understand it. But otherwise,
this is correct.

> This looks like it can
> be easily amended, though.

Not sure what you mean by amended. There's no user facing API to alter
this behavior of which I'm aware.

> One could also increase the heartbeat timeout value, I suppose. That might
> lead to undetected failures though.

That's a bad idea, yes.

> Just to clarify, the scenario I have in mind is this: a large company
> serving a cloud service has many datacenters, and when one of them is
> "full" computation-power-wise, they might want to spread new computational
> tasks to many datacenters.

Again, this won't work like you want. The reason why is that Hadoop
doesn't understand "data center local" data. It only knows about data
local (same machine), rack local (same rack) and elsewhere. If you
were to configure 3 data centers with 10 racks each, Hadoop would see
them all as equally viable destinations for block placement. It
wouldn't fill one data center and then spill to the next because it
can't tell which racks are in which data centers.

It *could* at some point and there does seem to be some interest in
this, but it doesn't. It's just not as simple as plugging in off
network IPs and hoping it does the right thing; there are tons of
corner cases that don't work all that well when you start doing this.
Data locality is one problem, but there's also the fact that there's
now a significant choke point between two nodes in two data centers -
a chain of routers. Even if it's a high bandwidth connection, the
failure semantics are very different. Without making Hadoop aware of
the multi-datacenter case, a failure of a router could easily lose all
replicas of a large number of blocks creating a huge hole in the data.

Again, it's about more than just performance here.
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

View raw message