Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <1100109354.1209341097200.JavaMail.jira@brutus>
Date: Sun, 27 Apr 2008 17:04:57 -0700 (PDT)
From: "Chris Douglas (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-3144) better fault tolerance for
 corrupted text files
In-Reply-To: <2040934784.1207007544426.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592703#action_12592703 ] 

Chris Douglas commented on HADOOP-3144:
---------------------------------------

bq. one of the founding principles of map-reduce as described in the google paper (and perhaps one of the most remarkable differences with general database systems) was the notion of being tolerant of bad data. if u see a few rows of bad data - skip it.

It's been awhile since I read the paper, but isn't recovery effected by skipping the record that caused a failure on the map (HADOOP-153)? Recovery from corrupted data without re-executing the map sounds like a solution for a less generic format than LineRecordReader; detecting and failing/discarding a map because its output is corrupt is application code, I agree, and this looks like Zheng has a very reasonable, general workaround (more below).

Given the re-execution model, the "correct" and more general fix would be to fail the map- with an OOM exception- and skip the range that had already been read. If it read into the following split, then it need not be rescheduled because we know that another task had already scanned up to the next record boundary (or failed trying). If one wants to fail the task earlier, then specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a point: a property that controls special cases for TextInputFormat is more usable. Without HADOOP-153, the point is moot, and perhaps this fix is more pressing as a consequence.

bq. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise an EOF is returned.

That's not a full description of what it does, though. I took a closer look, and it doesn't do what I had assumed, i.e. define both a max line length and force a hard limit for reading into the following split (which is why the archive format didn't seem like a non sequitur). It defines a single new property that defines the maximum line length, which prevents the situation in this JIRA by terminating the record reader if it's past the end of the split, having consumed the maximum line length. Since it takes the maximum of what remains in the split and the aforementioned length as the limit, the situation I asked after (i.e. returning the trailing part of a record as a single record) doesn't occur. Since it defaults to Long.MAX_VALUE, there's no issue with existing code. That's all I was trying to determine. The API change (changing the return type of readLine from {{int}} to {{long}}) makes more sense in this context, but it still seems unnecessary.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.