hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5728) [Diskfull] Block recovery will fail if the metafile not having crc for all chunks of the block
Date Wed, 22 Jan 2014 21:05:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879179#comment-13879179

Kihwal Lee commented on HDFS-5728:

The approach seems okay. It is actually what I did manually to recover. The new test case
seems to be adequate.
There are unnecessary lines of code added, though.

+          // truncate blockFile
+          blockRAF.setLength(validFileLength);
+          // read last chunk
+          blockRAF.seek(lastChunkStartPos);
+          blockRAF.readFully(b, 0, lastChunkSize);

In the above, the last chunk of the block doesn't have to be read. In {{truncateBlock()}},
which is called during {{recoverRbw()}}, this is needed in order to recompute the checksum
and write out to the meta file. It is done this way since simply truncating meta file will
cause checksum mismatch, if the new block size doesn't align with the chunk size.  In this
jira, this is not necessary since meta files are not truncated.

It made me think about the case where a block file is smaller than expected. With the current
code, 0 will be returned as the size. Instead, we could truncate the meta file if the block
file length is non-zero.  But this should be rare since a block file is written before  the
corresponding meta file.

> [Diskfull] Block recovery will fail if the metafile not having crc for all chunks of
the block
> ----------------------------------------------------------------------------------------------
>                 Key: HDFS-5728
>                 URL: https://issues.apache.org/jira/browse/HDFS-5728
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.23.10, 2.2.0
>            Reporter: Vinay
>            Assignee: Vinay
>            Priority: Critical
>         Attachments: HDFS-5728.patch, HDFS-5728.patch
> 1. Client (regionsever) has opened stream to write its WAL to HDFS. This is not one time
upload, data will be written slowly.
> 2. One of the DataNode got diskfull ( due to some other data filled up disks)
> 3. Unfortunately block was being written to only this datanode in cluster, so client
write has also failed.
> 4. After some time disk is made free and all processes are restarted.
> 5. Now HMaster try to recover the file by calling recoverLease. 
> At this time recovery was failing saying file length mismatch.
> When checked,
>  actual block file length: 62484480
>  Calculated block length: 62455808
> This was because, metafile was having crc for only 62455808 bytes, and it considered
62455808 as the block size.
> No matter how many times, recovery was continously failing.

This message was sent by Atlassian JIRA

View raw message