hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: corrupted edits log after power failure
Date Mon, 26 Sep 2011 14:34:22 GMT
On 22/09/11 20:15, Brian Bockelman wrote:
> Hi Gabi,
>
> I'd be a bit scared of that backup strategy; what happens if the TCP connection gets
cut suddenly during curl?  What happens if there's a TCP corruption?  Such things have happened
before.

Curl might work for long-haul backups, but I'd use HTTPS for its better 
checksums, and have alternate in-cluster strategies, such as shared HA 
filesystems

>
> Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't happened
in 30 minutes, people get emailed.  If it doesn't happen in 45 minutes, people get paged.

That's a good technique for verifying the SNN is actually working. 
Thinking it is working, when it isn't is danger

> In addition to writing out copies to a few disks and to NFS, we also have a versioned
backup of the checkpoint.prev.
>
> The worst case scenario would be if the SNN corrupts the image and uploads the corrupt
image (it's a theoretical situation so far...); this would be caught at the next merge, meaning
we trash up to 30 minutes of work.  This would ruin someone's day, but not someone's week.
>
> The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because
it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences).

That is: test your handling of the outage on a regular basis.


Mime
View raw message