hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Sat, 26 Apr 2008 06:55:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592554#action_12592554

Joydeep Sen Sarma commented on HADOOP-3144:

we did not point the text reader at a binary file. we had a corrupted text file filled with
long section of junk. 

given that this is a problem that can happen to anyone (we just happen to be the lucky first)
- and everyone uses textinputformat to read text files - why shouldn't the safeguard be built
into textinputformat? what's the downside? (does it make sense to buy insurance after an accident?
- we wait for people to hit such a problem and then say - oh, but u should have used 'SafeTextInputFormat'?)

portable across what? i looked at 3307 earlier today. i don't know how it's remotely related.
enlighten us.

i am sorry - but i am mildly irritated by the comments here. we are aware of the concept of
subclassing. and we can write our own inputformat - thank u so much. the whole point of going
through this procedure is to contribute back to the community something that is of general
benefit. Either the argument is that this is not of general benefit - or that the cost outweighs
the benefit. Neither argument has been made.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message