hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-970) FSImage writing should always fsync before close
Date Fri, 14 May 2010 06:30:45 GMT

    [ https://issues.apache.org/jira/browse/HDFS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867423#action_12867423
] 

Todd Lipcon commented on HDFS-970:
----------------------------------

To prove that this is indeed absolutely necessary, I performed the following test on my desktop
(2.6.31):

1) Create a loop device with 1G storage
{code}# dd if=/dev/zero of=myloop bs=1M count=1000
# losetup -f myloop{code}
2) Make a "faulty" type md array:
{code}# mdadm --create /dev/md0 --level=faulty --raid-devices=1  /dev/loop1{code}
3) format it as ext4
{code}# mkfs.ext4 /dev/md0{code}
4) mount it in /mnt
{code}# mount -t ext4 /dev/md0 /mnt{code}
5) run the following python script:
{code}
#!/usr/bin/env python
import os

for idx in xrange(1, 100000):
  f = file("file_%d_ckpt" % idx, "w")
  for line in xrange(0, 1000000):
    print >>f, "hello world! this is line %d " % line
  f.close()
  os.rename("file_%d_ckpt" % idx, "file_%d" % idx)
  print "Saved file %d" % idx
{code}

6) While running, block all writes to the disk (this essentially freezes the disk as if a
power outage occurred):
{code}# mdadm --grow /dev/md0 -l faulty -p write-all{code}
Script output:
{code}
Saved file 1
Saved file 2
Saved file 3
Saved file 4
Saved file 5
Traceback (most recent call last):
  File "/home/todd/disk-fault/test.py", line 7, in <module>
    print >>f, "hello world! this is line %d " % line
IOError: [Errno 30] Read-only file system
{code}
[ext4 automatically remounts itself readonly]
7) umount /mnt, clear the fault with -p clear, remount /mnt
8) results of ls -l:
{code}
root@todd-desktop:/mnt# ls -l
total 16
-rw-r--r-- 1 root root     0 2010-05-13 23:11 file_1
-rw-r--r-- 1 root root     0 2010-05-13 23:11 file_2
-rw-r--r-- 1 root root     0 2010-05-13 23:11 file_3_ckpt
drwx------ 2 root root 16384 2010-05-13 22:45 lost+found
{code}

I then modified the test script to add f.flush() and os.fsync(f.fileno()) right before the
close(), and ran the exact same test. Results:
{code}
root@todd-desktop:/mnt# ~/disk-fault/test.py 
Saved file 1
Saved file 2
Traceback (most recent call last):
  File "/home/todd/disk-fault/test.py", line 9, in <module>
    os.fsync(f.fileno())

[umount, clear fault, remount]

root@todd-desktop:/mnt# ls -l
total 66208
-rw-r--r-- 1 root root 33888890 2010-05-13 23:20 file_1
-rw-r--r-- 1 root root 33888890 2010-05-13 23:20 file_2_ckpt
drwx------ 2 root root    16384 2010-05-13 22:45 lost+found
{code}

I tried the same test on ext3, and without the fsync the files entirely disappeared. The same
was true of xfs. Adding the fsync before close fixed the issue in all cases.


> FSImage writing should always fsync before close
> ------------------------------------------------
>
>                 Key: HDFS-970
>                 URL: https://issues.apache.org/jira/browse/HDFS-970
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hdfs-970.txt
>
>
> Without an fsync, it's common that filesystems will delay the writing of metadata to
the journal until all of the data blocks have been flushed. If the system crashes while the
dirty pages haven't been flushed, the file is left in an indeterminate state. In some FSs
(eg ext4) this will result in a 0-length file. In others (eg XFS) it will result in the correct
length but any number of data blocks getting zeroed. Calling FileChannel.force before closing
the FSImage prevents this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message