hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Jones <darel...@gmail.com>
Subject Re: Hadoop over the internet
Date Sat, 17 Apr 2010 17:53:13 GMT
I think the biggest issue would be upstream bandwidth and latency.  If the
thought was to use a Seti type approach, most users wouldn't have the
necessary upstream bandwidth to support the DFS.  It would be likely that a
few local desktop machines would significantly out pace a much larger
DSL/cable/etc. based "cluster."


On Sat, Apr 17, 2010 at 12:43 PM, Eric Sammer <esammer@cloudera.com> wrote:

> This is likely to fail, yes. The reason why is because you'll almost
> certainly encounter timeouts in the heartbeats between data nodes and
> the name node and the task trackers and job tracker. Also, Hadoop uses
> pipe line replication between data nodes (client -> DN1 -> DN2 -> ...)
> which will also encounter timeouts or very poor performance. On the
> processing side, Hadoop doesn't understand the difference between data
> centers, only racks, and is likely to make bad decisions about
> spreading work around such that a minimal amount of data is passed
> over public connections. Then there's the security component (i.e.
> there isn't any, really)...
> There are a lot of reasons not to do this right now.
> On Sat, Apr 17, 2010 at 4:29 AM,  <altanis@ceid.upatras.gr> wrote:
> > Hello,
> >
> > I want to investigate the matter of running hadoop MapReduce jobs over
> the
> > Internet. I don't mean in private computers, all of them in different
> > places, rather a collection of datacenters, connected to each other over
> > the Internet.
> >
> > Would that fail? If yes, how and why? What issues would arise?
> >
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com

View raw message