hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files
Date Mon, 28 Apr 2008 04:00:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592732#action_12592732
] 

Joydeep Sen Sarma commented on HADOOP-3144:
-------------------------------------------

> isn't recovery effected by skipping the record that caused a failure on the map (HADOOP-153)?

thanks for pointing this out. this jira is not fixed and looks like there's still a debate
on what the right approach is .. it seems that even if the jira were fixed - the linerecordreader
would have to implement an additional api to skip to the next record boundary (to skip the
bad record on map re-try) - so looks like we would need similar code - albeit under a different
api. 

that said - i am not sure i agree with the design of 153. it's not clear to me why it doesn't
suffice to let the recordreaders skip bad records (as they must be able to even with 153's
additional apis). but that's a separate discussion ..

what's the status of 153? seems like depending on where it goes - these changes may conflict
or overlap ..

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at source prior
to copying into hadoop). inevitably - some of the data looks like a really really long line
and hadoop trips over trying to stuff it into an in memory object and gets outofmem error.
Code looks same way in trunk as well .. 
> so looking for an option to the textinputformat (and like) to ignore long lines. ideally
- we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message