hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alta...@ceid.upatras.gr
Subject Re: Hadoop over the internet
Date Tue, 20 Apr 2010 16:53:56 GMT
Thank you for your answers. I also thought bandwidth would be the main
problem. However the thought wasn't so much a SETI-type approach, but a
cooperation between large datacenters. Do you think things would be
different if you assume that the bandwidth of the participants is

@Eric Sammers
Could you elaborate on pipe line replication a bit more? The way I
understood it, the input is copied to one DataNode from the client, and
then to another from the first DataNode and so on. This looks like it can
be easily amended, though.

One could also increase the heartbeat timeout value, I suppose. That might
lead to undetected failures though.

Just to clarify, the scenario I have in mind is this: a large company
serving a cloud service has many datacenters, and when one of them is
"full" computation-power-wise, they might want to spread new computational
tasks to many datacenters.


> I think the biggest issue would be upstream bandwidth and latency.  If the
> thought was to use a Seti type approach, most users wouldn't have the
> necessary upstream bandwidth to support the DFS.  It would be likely that
> a
> few local desktop machines would significantly out pace a much larger
> DSL/cable/etc. based "cluster."
> Nick
> On Sat, Apr 17, 2010 at 12:43 PM, Eric Sammer <esammer@cloudera.com>
> wrote:
>> This is likely to fail, yes. The reason why is because you'll almost
>> certainly encounter timeouts in the heartbeats between data nodes and
>> the name node and the task trackers and job tracker. Also, Hadoop uses
>> pipe line replication between data nodes (client -> DN1 -> DN2 -> ...)
>> which will also encounter timeouts or very poor performance. On the
>> processing side, Hadoop doesn't understand the difference between data
>> centers, only racks, and is likely to make bad decisions about
>> spreading work around such that a minimal amount of data is passed
>> over public connections. Then there's the security component (i.e.
>> there isn't any, really)...
>> There are a lot of reasons not to do this right now.
>> On Sat, Apr 17, 2010 at 4:29 AM,  <altanis@ceid.upatras.gr> wrote:
>> > Hello,
>> >
>> > I want to investigate the matter of running hadoop MapReduce jobs over
>> the
>> > Internet. I don't mean in private computers, all of them in different
>> > places, rather a collection of datacenters, connected to each other
>> over
>> > the Internet.
>> >
>> > Would that fail? If yes, how and why? What issues would arise?
>> >
>> --
>> Eric Sammer
>> phone: +1-917-287-2675
>> twitter: esammer
>> data: www.cloudera.com

View raw message