Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B39311108 for ; Fri, 18 Jul 2014 04:52:06 +0000 (UTC) Received: (qmail 53679 invoked by uid 500); 18 Jul 2014 04:52:05 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 53571 invoked by uid 500); 18 Jul 2014 04:52:05 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 53559 invoked by uid 99); 18 Jul 2014 04:52:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jul 2014 04:52:04 +0000 Date: Fri, 18 Jul 2014 04:52:04 +0000 (UTC) From: "Allen Wittenauer (JIRA)" To: common-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (HADOOP-145) io.skip.checksum.errors property clashes with LocalFileSystem#reportChecksumFailure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved HADOOP-145. ------------------------------------- Resolution: Fixed Long ago. > io.skip.checksum.errors property clashes with LocalFileSystem#reportChecksumFailure > ----------------------------------------------------------------------------------- > > Key: HADOOP-145 > URL: https://issues.apache.org/jira/browse/HADOOP-145 > Project: Hadoop Common > Issue Type: Bug > Components: io > Reporter: stack > Assignee: Owen O'Malley > > Below is from email to the dev list on Tue, 11 Apr 2006 14:46:09 -0700. > Checksum errors seem to be a fact of life given the hardware we use. They'll often cause my jobs to fail so I have been trying to figure how to just skip the bad records and files. At the end is a note where Stefan pointed me at 'io.skip.checksum.errors'. This property, when set, triggers special handling of checksum errors inside SequenceFile$Reader: If a checksum, try to skip to next record. Only, this behavior can conflict with another checksum handler that moves aside the problematic file whenever a checksum error is found. Below is from a recent log. > 060411 202203 task_r_22esh3 Moving bad file /2/hadoop/tmp/task_r_22esh3/task_m_e3chga.out to /2/bad_files/task_m_e3chga.out.1707416716 > 060411 202203 task_r_22esh3 Bad checksum at 3578152. Skipping entries. > 060411 202203 task_r_22esh3 Error running child > 060411 202203 task_r_22esh3 java.nio.channels.ClosedChannelException > 060411 202203 task_r_22esh3 at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89) > 060411 202203 task_r_22esh3 at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:276) > 060411 202203 task_r_22esh3 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileInputStream.seek(LocalFileSystem.java:79) > 060411 202203 task_r_22esh3 at org.apache.hadoop.fs.FSDataInputStream$Checker.seek(FSDataInputStream.java:67) > 060411 202203 task_r_22esh3 at org.apache.hadoop.fs.FSDataInputStream$PositionCache.seek(FSDataInputStream.java:164) > 060411 202203 task_r_22esh3 at org.apache.hadoop.fs.FSDataInputStream$Buffer.seek(FSDataInputStream.java:193) > 060411 202203 task_r_22esh3 at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:243) > 060411 202203 task_r_22esh3 at org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:420) > 060411 202203 task_r_22esh3 at org.apache.hadoop.io.SequenceFile$Reader.sync(SequenceFile.java:431) > 060411 202203 task_r_22esh3 at org.apache.hadoop.io.SequenceFile$Reader.handleChecksumException(SequenceFile.java:412) > 060411 202203 task_r_22esh3 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:389) > 060411 202203 task_r_22esh3 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:209) > 060411 202203 task_r_22esh3 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709) > (Ignore line numbers. My code is a little different from main because I've other debugging code inside in SequenceFile. Otherwise I'm running w/ head of hadoop). > The SequenceFile$Reader#handleChecksumException is trying to skip to next record but the file has been closed by the move-aside. > On the list there is some discussion on merit of moving aside file when bad checksum found. I've trying to test what happens if we leave the file in place but haven't had a checksum error in a while. > Opening this issue so place to fill in experience as we go. -- This message was sent by Atlassian JIRA (v6.2#6252)