hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: To retrieve data on dead node
Date Wed, 29 Jul 2009 18:28:13 GMT
On Wed, Jul 29, 2009 at 8:51 AM, bhushan_mahale <
bhushan_mahale@persistent.co.in> wrote:

> Hi,
> What are the possible ways to retrieve the data if a node goes down in a
> Hadoop cluster?
> Assuming replication factor as 3, and 3 nodes goes down in a 10 node
> cluster, how do we retrieve the data?

Hi Bhushan,

If 3 nodes go down at the same time, some of your data will become
inaccessible. If you cannot recover at least one of those nodes, you will
have no way to recover the data. If you can recover at least one, then the
blocks will become available at replication count 1. The NN will notice the
underreplicated blocks and trigger rereplication to get them back up to 3.

If your nodes fail one-by-one with some time in between, the NN should have
time to trigger rereplication between them and the blocks will never be

In general, simultaneous failures occur in two ways in the datacenter: one
is that the entire datacenter has lost power (or forced shutdown due to lost
cooling). In this case, no amount of replication within the DC will help.
The other failure is that power (or network) is lost to an entire rack,
either due to a switch failure or a failed PDU. If you've configured
Hadoop's rack-awareness, it will ensure that each block is replicated on at
least two racks to mitigate the downside of a rack loss.

Depending on your particular setup, it may be worth putting your 10-node
cluster spread across separate power circuits and configuring them as
separate "racks" in Hadoop, if you're concerned about flaky rack PDUs.

Hope that helps

> Thanks,
> - Bhushan
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message