hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Fri, 25 Apr 2008 23:01:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592526#action_12592526

Chris Douglas commented on HADOOP-3144:

* Do these limits really need to be a longs? Changing the public API of readLine seems unnecessary
when an int should be- and has been- sufficient.
* There is some odd spacing around LineRecordReader::157,268 that make it difficult to tell
which block the closing brace belongs to
* I'm not sure I understand the skip logic. For the case where a line is larger than 64k (the
buffer size), it looks like this reads up to a threshold, then discards input that exceeds
what was requested, then returns the next record as the segment between the point in the threshold
and the following newline (i.e. the trailing bytes of the too-long record). Is this accurate?
Instead of getting a random segment of a record, wouldn't it be preferred to discard input
until the next record boundary is found?

> better fault tolerance for corrupted text files
> -----------------------------------------------
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message