hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Mon, 28 Apr 2008 00:04:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592703#action_12592703

Chris Douglas commented on HADOOP-3144:

bq. one of the founding principles of map-reduce as described in the google paper (and perhaps
one of the most remarkable differences with general database systems) was the notion of being
tolerant of bad data. if u see a few rows of bad data - skip it.

It's been awhile since I read the paper, but isn't recovery effected by skipping the record
that caused a failure on the map (HADOOP-153)? Recovery from corrupted data without re-executing
the map sounds like a solution for a less generic format than LineRecordReader; detecting
and failing/discarding a map because its output is corrupt is application code, I agree, and
this looks like Zheng has a very reasonable, general workaround (more below).

Given the re-execution model, the "correct" and more general fix would be to fail the map-
with an OOM exception- and skip the range that had already been read. If it read into the
following split, then it need not be rescheduled because we know that another task had already
scanned up to the next record boundary (or failed trying). If one wants to fail the task earlier,
then specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a point: a property
that controls special cases for TextInputFormat is more usable. Without HADOOP-153, the point
is moot, and perhaps this fix is more pressing as a consequence.

bq. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise
an EOF is returned.

That's not a full description of what it does, though. I took a closer look, and it doesn't
do what I had assumed, i.e. define both a max line length and force a hard limit for reading
into the following split (which is why the archive format didn't seem like a non sequitur).
It defines a single new property that defines the maximum line length, which prevents the
situation in this JIRA by terminating the record reader if it's past the end of the split,
having consumed the maximum line length. Since it takes the maximum of what remains in the
split and the aforementioned length as the limit, the situation I asked after (i.e. returning
the trailing part of a record as a single record) doesn't occur. Since it defaults to Long.MAX_VALUE,
there's no issue with existing code. That's all I was trying to determine. The API change
(changing the return type of readLine from {{int}} to {{long}}) makes more sense in this context,
but it still seems unnecessary.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message