hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Is it safe to set default/minimum replication to 2?
Date Thu, 22 Jul 2010 04:28:58 GMT
On Wed, Jul 21, 2010 at 11:45 PM, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
> Hi Bobby,
> We keep 2 or so replicas here at Nebraska.  We have about 800TB of raw space.
> As a rule of thumb, we:
> 1) Increase the replication of extremely important files.  We are a site for the LHC,
so a large part of our data is stored on tape elsewhere (but not everything!).  It's an operational
pain to re-download a few tens of TB, but not the end of the world.
> 2) Estimate that we will lose 1 file per month due to disk corruption and loss.
> 3) Make sure our management understands the risks due to (2) and what would occur during
a double-node failure.
> There are 2 other similarly sized sites that roughly follow the same rules.
> We haven't discovered any fatal software bugs that cause data loss since the various
ones in 0.19 were ironed out.
> Brian
> On Jul 21, 2010, at 8:29 PM, Bobby Dennett wrote:
>> The team that manages our Hadoop clusters is currently being pressured
>> to reduce block replication from 3 to 2 in our production cluster. This
>> request is for various reasons -- particularly the reduction of used
>> space in the cluster and potential of reduced write operations -- but
>> from what I've read previously, it seems to be strongly discouraged.
>> Of course I can't find it now, but I recall seeing a post that Doug
>> Cutting was involved with stating that having replication 3 is something
>> like 100 times "safer" than replication 2. If I remember correctly,
>> there was mention of potential NameNode bugs that could introduce
>> undetected corrupted/missing replicas so the idea was that if more
>> replicas are created, the chance of this type of bug is much less. On a
>> related note, it seems that the companies using a reduced replication
>> factor (e.g. Facebook) have also built an application layer on top of
>> Hadoop to perform exception handling, corruption issues, etc.
>> Unfortunately, we do not currently have the resources to do something
>> similar.
>> For anyone currently using a replication of 2 in production, can you
>> please share your experience and any issues you may have encountered?
>> Also, I would appreciate any thoughts about whether a replication factor
>> of 2 can be considered "safe".
>> Thanks in advance,
>> -Bobby

+1 On the above.
We use replication of 2. I have not seen a specific problem from it.
When you get a nagios alert while at the beach on a Saturday, that you
lost a node, you sweat a little more and it is not from the sun.

View raw message