accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Rack and Datacenter Awareness
Date Fri, 24 Jan 2014 10:25:22 GMT
there isn't any multi-DC support in hadoop core right now -block placement,
work scheduling is done on the assumption that there is a bandwidth cost
for working across that backbone, but it is still gigabit rate all the way
up its tree. Even if your cross site bandwidth is that high, other
assumptions in the Hadoop code surface

   - latency of cross-rack communications is low and not significantly
   different from intra-rack comms
   - protocols between machines can assume that packet failures are the
   kind that surface on a LAN, not a WAN (lower failure rate, less
   buffered/bursty traffic)
   - external infrastructure (like DNS, NTP) are consistent everywhere
   - Network partitions are rare and significant enough to react to by
   re-replicating data -an expensive operation and overkill for a transient
   WAN outage.

Zookeeper will be extra-brittle here, as group membership protocols tend to
have aggressive timeouts to detect loss of members fast. I'd be really
surprised if it worked well across >1 site.

Wandisco do provide multi-DC hadoop support for Hadoop platforms: - I don't know what that
does about applications running on the cluster, or depend on ZK.

I suspect that anyone working across sites is going to have to run ZK &
Accumulo on each, and treat operations that span sites as separate queries
where the results need to be merged in -something that would need to go
into the query engines, though for now you could build a workflow that did
an MR on each site, then an final reduce on one of the sites


On 22 January 2014 19:01, Jeff N <> wrote:

> @Adam
> I am currently interested with the latter half of your second question. My
> main interest lies in determining how to optimize data processing. If I
> have
> two data centers that are geographically far apart and I am working on a
> local machines but I need data from the second data center, how do I have
> the processing occur on the second data center? The constraints to this
> problem include a lack of empirical knowledge of the HDFS node that the
> data
> contains, but is within the network topology I currently reside in.
> Furthermore, it pertains to Map/Reduce jobs that utilize the
> AccumuloInputFormat. Is it possible to have the distant data center process
> my Mapper and send me the resulting data set instead of processing the
> Mapper locally and making numerous network queries?
> -----
> --
> View this message in context:
> Sent from the Developers mailing list archive at

NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message