hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1264) 0.20: OOME in HDFS client made an unrecoverable HDFS block
Date Wed, 23 Jun 2010 19:54:50 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881843#action_12881843

Todd Lipcon commented on HDFS-1264:

The first OOME happened with this trace, which apparently borked the checksum buffer in some
interesting way:

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2204)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3085)
        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3168)
        at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)

Will upload the logs for this block as well.

> 0.20: OOME in HDFS client made an unrecoverable HDFS block
> ----------------------------------------------------------
>                 Key: HDFS-1264
>                 URL: https://issues.apache.org/jira/browse/HDFS-1264
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, hdfs client
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>             Fix For: 0.20-append
>         Attachments: blk_logs_sorted.txt
> Ran into a bad issue in testing overnight. One of the writers experienced an OOME in
the middle of writing a checksum chunk to the stream inside a sync() call. It then proceeded
to retry recovery on each DN in the pipeline, but each recovery failed because its internal
checksum buffer was borked in some way - on the DNs I see "Unexpected checksum mismatch" errors
after each recovery attempt.
> When another client tried to recover the file using appendFile, it got the "Partial CRC
3766269197 does not match value computed the  last time file was closed" error (plus there
was only one replica left in targets). It thus failed to set up the append pipeline, and ran
into HDFS-1262.
> This was on 0.20-append, though it may happen on trunk as well.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message