Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 78294 invoked from network); 4 Nov 2008 22:17:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Nov 2008 22:17:44 -0000 Received: (qmail 36177 invoked by uid 500); 4 Nov 2008 22:17:41 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 36147 invoked by uid 500); 4 Nov 2008 22:17:41 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 36132 invoked by uid 99); 4 Nov 2008 22:17:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Nov 2008 14:17:41 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Nov 2008 22:16:32 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5BBDB234C27A for ; Tue, 4 Nov 2008 14:16:44 -0800 (PST) Message-ID: <1777052070.1225837004374.JavaMail.jira@brutus> Date: Tue, 4 Nov 2008 14:16:44 -0800 (PST) From: "Jason (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3914) checksumOk implementation in DFSClient can break applications In-Reply-To: <438292809.1218087944241.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645091#action_12645091 ] Jason commented on HADOOP-3914: ------------------------------- We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine) hadoop dfs -text path_to_directory/'*' 08/11/04 14:08:32 [main] INFO fs.FSInputChecker: java.io.IOException: Checksum ok was sent and should not be sent again at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1392) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.io.SequenceFile$Metadata.readFields(SequenceFile.java:725) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1511) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1431) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415) at org.apache.hadoop.fs.FsShell$TextRecordInputStream.(FsShell.java:365) at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:403) at org.apache.hadoop.fs.FsShell.access$200(FsShell.java:49) at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865) at org.apache.hadoop.fs.FsShell.text(FsShell.java:413) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) > checksumOk implementation in DFSClient can break applications > ------------------------------------------------------------- > > Key: HADOOP-3914 > URL: https://issues.apache.org/jira/browse/HADOOP-3914 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.1 > Reporter: Christian Kunz > Assignee: Christian Kunz > Priority: Blocker > Fix For: 0.18.2 > > Attachments: checksumOk.patch, checksumOk1-br18.patch, checksumOk1.patch, patch.HADOOP-3914 > > > One of our non-map-reduce applications (written in C and using libhdfs to access dfs) stopped working after switch from 0.16 to 0.17. > The problem was finally traced down to failures in checksumOk. > I would assume, the purpose of checksumOk is for a DfsClient to indicate to a sending Datanode that the checksum of the received block is okay. This must be useful in the replication pipeline. > How checksumOk is implemented is that any IOException is caught and ignored, probably because it is not essential for the client that the message is successful. > But it proved fatal for our application because this application links in a 3rd-party library which seems to catch socket exceptions before libhdfs. > Why was there an Exception? In our case the application reads a file into the local buffer of the DFSInputStream large enough to hold all data, the application reads to the end and the checksumOK is sent successfully at that time. But then the application does some other stuff and comes back to re-read the file (still open). It is then when it calls checksumOk again and crashes. > It can easily be avoided by adding a Boolean making sure that checksumOk is called exactly once when EOS is encountered. Redundant calls to checksumOk do not seem to make sense anyhow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.