Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 71769 invoked from network); 23 Jun 2010 19:55:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Jun 2010 19:55:15 -0000 Received: (qmail 5813 invoked by uid 500); 23 Jun 2010 19:55:15 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 5618 invoked by uid 500); 23 Jun 2010 19:55:14 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 5610 invoked by uid 99); 23 Jun 2010 19:55:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Jun 2010 19:55:14 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Jun 2010 19:55:12 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5NJsoZa005214 for ; Wed, 23 Jun 2010 19:54:50 GMT Message-ID: <32707500.22831277322890110.JavaMail.jira@thor> Date: Wed, 23 Jun 2010 15:54:50 -0400 (EDT) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-1264) 0.20: OOME in HDFS client made an unrecoverable HDFS block In-Reply-To: <12062439.22801277322651551.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881843#action_12881843 ] Todd Lipcon commented on HDFS-1264: ----------------------------------- The first OOME happened with this trace, which apparently borked the checksum buffer in some interesting way: Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.(DFSClient.java:2204) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3085) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3168) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) Will upload the logs for this block as well. > 0.20: OOME in HDFS client made an unrecoverable HDFS block > ---------------------------------------------------------- > > Key: HDFS-1264 > URL: https://issues.apache.org/jira/browse/HDFS-1264 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client > Affects Versions: 0.20-append > Reporter: Todd Lipcon > Fix For: 0.20-append > > Attachments: blk_logs_sorted.txt > > > Ran into a bad issue in testing overnight. One of the writers experienced an OOME in the middle of writing a checksum chunk to the stream inside a sync() call. It then proceeded to retry recovery on each DN in the pipeline, but each recovery failed because its internal checksum buffer was borked in some way - on the DNs I see "Unexpected checksum mismatch" errors after each recovery attempt. > When another client tried to recover the file using appendFile, it got the "Partial CRC 3766269197 does not match value computed the last time file was closed" error (plus there was only one replica left in targets). It thus failed to set up the append pipeline, and ran into HDFS-1262. > This was on 0.20-append, though it may happen on trunk as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.