hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From feedly team <feedly...@gmail.com>
Subject re-replication after data node failure
Date Wed, 26 Mar 2014 14:51:41 GMT
We recently had a node die in our hbase cluster. Afterwards, we saw a huge
increase in traffic and I/O as hdfs re-replicated data from the dead node.
This negatively affected our application and we are trying to see if there
is a way to slow down this process so the app can still run (if a bit
slower).

Is the balancer job responsible for re-replication? This was our first
thought but the docs mostly mention balancing disk utilization rather than
restoring the replication factor, so we aren't sure if it's responsible or
if it's some other process.

If it is indeed the balancer, we saw there is a dfs.balance.bandwidthPerSec
setting that we could change. The default is 1MB, does this mean that each
node sends and receives at most 1MB/sec during balancing? We saw much, much
higher sustained traffic than this. The levels we saw would be roughly
correct if this is the in + out limit per data node pair. I.e. if you have
a 5 node cluster, node1 would be limited to 1MB to each of the other 4
nodes, meaning the node would experience 4MB/s of traffic.

Mime
View raw message