hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin McCabe <cmcc...@alumni.cmu.edu>
Subject Re: data loss after cluster wide power loss
Date Tue, 09 Jul 2013 01:53:40 GMT
Thanks.  Suresh and Kihwal are right-- renames are journalled, but not
necessarily durable (stored to disk).  I was getting mixed up with
HDFS semantics, in which we actually do make the journal durable
before returning success to the client.

It might be a good idea for HDFS to fsync the file descriptor of the
directories involved in the rename operation, before assuming that the
operation is durable.

If you're using ext{2,3,4}, a quick fix would be to use mount -o
dirsync.  I haven't tested it out, but it's supposed to make these
operations synchronous.

>From the man page:
              All directory updates within the filesystem should be done  syn-
              chronously.   This  affects  the  following system calls: creat,
              link, unlink, symlink, mkdir, rmdir, mknod and rename.


On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <suresh@hortonworks.com> wrote:
> On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com>
>> wrote:
>> > Dave,
>> >
>> > Thanks for the detailed email. Sorry I did not read all the details you
>> had
>> > sent earlier completely (on my phone). As you said, this is not related
>> to
>> > data loss related to HBase log and hsync. I think you are right; the
>> rename
>> > operation itself might not have hit the disk. I think we should either
>> > ensure metadata operation is synced on the datanode or handle it being
>> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>> In theory, ext3 is journaled, so all metadata operations should be
>> durable in the case of a power outage.  It is only data operations
>> that should be possible to lose.  It is the same for ext4.  (Assuming
>> you are not using nonstandard mount options.)
> ext3 journal may not hit the disk right. From what I read, if you do not
> specifically
> call sync, even the metadata operations do not hit disk.
> See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt
> commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
>                         every 'nrsec' seconds. The default value is 5 seconds.
>                         This means that if you lose your power, you will lose
>                         as much as the latest 5 seconds of work (your
>                         filesystem will not be damaged though, thanks to the
>                         journaling).  This default value (or any low value)
>                         will hurt performance, but it's good for data-safety.
>                         Setting it to 0 will have the same effect as leaving
>                         it at the default (5 seconds).
>                         Setting it to very large values will improve
> performance.

View raw message