hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Sat, 26 Apr 2008 02:35:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592550#action_12592550

Chris Douglas commented on HADOOP-3144:

bq. It used to be sufficient does not mean that they will be sufficient in the future - that's
why we have open64. The cost of using a long instead of an int is minimal, while we do avoid
potential overflow problems

True, but it's accumulating bytes read from a text file into memory for a single record. It's
not at all obvious to me that this requires a long. Future-proofing a case that will be a
total disaster for the rest of the framework seems premature, particularly when the change
is to a generic text parser. If someone truly needs to slurp >2GB of text data _per record_,
surely their requirements justify a less general RecordReader. It's not the cost of the int
that concerns me, but rather it's the API change to support a case that's not only degenerate,
but implausible.

bq. The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is
no reason for the readLine to go through tens of gigs of data search for an end of line, while
the current block is only 128MB.

A far more portable solution for what this expresses would be an InputFormat generating a
subclass of FileSplit annotated with a hard limit enforced by the RecordReader (i.e. returns
EOF at some position within the file). Some of this will inevitably be done as part of the
Hadoop archive work (HADOOP-3307). As a workaround, don't point text readers at binary data.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message