hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Sat, 26 Apr 2008 16:17:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592583#action_12592583
] 

Joydeep Sen Sarma commented on HADOOP-3144:
-------------------------------------------

one of the founding principles of map-reduce as described in the google paper (and perhaps
one of the most remarkable differences with general database systems) was the notion of being
tolerant of bad data. if u see a few rows of bad data - skip it. 

we try to do this in the application land as much as possible. however, it is not possible
for us to do anything if Hadoop throws an out of memory error. Hence this fix belongs in hadoop
core. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise
an EOF is returned. 

3307 is way out there man - it's a solution for small files. if the file was small - we wouldn't
have a problem to begin with (as u say - the input is bounded). this problem only affects
large files. if u read the description of 3307 carefully - u will notice it says that it has
no impact on map-reduce. The problem we are trying to solve is a map-reduce problem - it applies
whether the file comes from an archive (3307) or from local file system or hdfs (or any file
system for that matter). 

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message