hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Sat, 26 Apr 2008 01:43:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592539#action_12592539

Zheng Shao commented on HADOOP-3144:

* It used to be sufficient does not mean that they will be sufficient in the future - that's
why we have open64. The cost of using a long instead of an int is minimal, while we do avoid
potential overflow problems. The only interesting usage of this return value is accumulating
the number of bytes read, which definitely should be stored in a long. So I don't see a problem

* I will fix the spacing problem when we get a consensus on other problems.

* The skip logic is to skip the whole long line - not just "maxLineLength" of bytes.

The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no
reason for the readLine to go through tens of gigs of data search for an end of line, while
the current block is only 128MB.  This is actually what was happening on our cluster - for
binary file that a user mistakenly treats as a text file. All map jobs just swamped the cluster.
The only use of maxBytesToConsume is to let readLine know when to stop. What would be the
best way to fix this?

> better fault tolerance for corrupted text files
> -----------------------------------------------
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message