hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Sat, 26 Apr 2008 09:52:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592560#action_12592560

Chris Douglas commented on HADOOP-3144:

I was going by Zheng's last comment, i.e. "This is actually what was happening on our cluster
- for binary file that a user mistakenly treats as a text file." I didn't mean anything by

Assuming we're discussing the third bullet in the last two comments:

If one is loading corrupted data into HDFS, then I don't think it's fair to assume that the
most generic of text readers can do anything with it. I mention the archive format because
it seems unavoidable that opening a file in an archive will return bounded stream within a
large, composite file, i.e. be agnostic to the particular InputFormat employed, but act a
lot like a bounded FileSplit. If that's the sort of thing you could use to deal with sketchy
data, then it seemed to be a useful issue to monitor. Alternatively, a new InputFormat that
generated bounded splits for Text files to recover from this condition might work for your
case, and probably for others' if you felt like contributing it.

The insurance analogy doesn't seem to describe this error. It's not like a car accident; it's
like filling one's gas tank with creamed corn. Though he had every reason to believe it was
gasoline- and is understandably angry that his engine is full of creamed corn- anger at the
car for failing to run on the creamed corn is misspent. Though I like the idea in general-
i.e. skipping unexpectedly long lines, or even just truncating records- my original question
was trying to determine whether it skipped to the next record, continued reading bytes into
the next record from wherever it stopped, or quit for extremely long lines. At a glance, it
looked like it continued reading at wherever it left off in the stream, but I haven't looked
at it as closely as the contributor and wanted to ask after its behavior. I'm still curious
how, exactly, this patch effects its solution.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message