hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Stretched HDFS cluster
Date Wed, 16 Sep 2009 11:27:26 GMT
Touretsky, Gregory wrote:
> Hi,
> 
>     Does anyone have an experience running HDFS cluster stretched over high-latency WAN
connections?
> Any specific concerns/options/recommendations?
> I'm trying to setup the HDFS cluster with the nodes located in the US, Israel and India
- considering it as a potential solution for cross-site data sharing...
> 

I would back up todd here and say "don't do it -yet". I think there are 
some minor placeholders in the rack hierarchy to have an explicit notion 
of different sites, but nobody has done the work yet. Cross datacentre 
data balancing and work scheduling is complex, and all the code in 
Hadoop, zookeeper, etc, is built on the assumption that latency is low, 
all machines clocks are going forward at roughly the same rate, the 
network is fairly reliable, routers are unlikely to corrupt data, etc.

Now, if you do want to do >1 site, it would be a profound and useful 
development -I'd expect the MR scheduler, or even the Pig/Hive code 
generators , to take datacentre locality into account, doing as much 
work per site as possible. The problem of block distribution changes 
too, as you would want 1 copy of each block in the other datacentre. 
Even then, I'd start with sites in a single city, on a MAE or other link 
where bandwidth matters less. Note that (as discussed below) on the MAN 
scale things can start to go wrong in ways that are unlikely in a 
datacentre, and its those failures that will burn you

worth reading
http://status.aws.amazon.com/s3-20080720.html
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

-Steve

Mime
View raw message