hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-153) skip records that throw exceptions
Date Mon, 03 Mar 2008 09:58:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574404#action_12574404
] 

Devaraj Das commented on HADOOP-153:
------------------------------------

Here is a proposal that tries to address the scenarios discussed in this jira:
0) Define the concept of a _failed record number_ that is set by Tasks and propagated to the
JobTracker on task failures. This becomes part of the TIP object at the JobTracker.
1) Define an API in the RecordReader to do with getting the record boundary. On getting an
exception in RecordReader.next, the task starts from the beginning of the last successfully
read record till the boundary and reads the next record from that point (ignoring the boundary
bytes). Applies to maps.
2) Define an API in RecordWriter to do with writing record boundary along with every write(k,v).
The record boundary can default to the sync bytes. Tasks fail when they get an exception while
writing a bad record. With (0), in the subsequent retries, the records can be skipped. This
applies to outputs of maps and reduces.
3) Define an API in RecordReader to do with whether we want to have recovery while reading
records on not (useful for e.g. if the RecordReader has side effects in the next() method
that would affect the reading of the subsequent record if there was an exception for the current
record). 

In cases of applications throwing exceptions in the map/reduce methods, the exception is caught
by the Task method, which invoked the map/reduce method. The task attempt is killed after
noting the record number of the failed invocation. With the above point (0), these records
are not fed to the m/r methods in the subsequent retries.

The recovery/skip above is done on a best effort basis. That is, the worst case is that tasks
fail!

The above strategies should at least allow configuring the max % of records that will keep
the recovery/skip cycle going before the job is declared as having failed. Also, the "skip
records on re-execution" should probably be configurable on a per job basis by the user (since
the exceptions could have been caused due to reasons other than incorrect input).

Thoughts?

> skip records that throw exceptions
> ----------------------------------
>
>                 Key: HADOOP-153
>                 URL: https://issues.apache.org/jira/browse/HADOOP-153
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>            Assignee: Devaraj Das
>             Fix For: 0.17.0
>
>
> MapReduce should skip records that throw exceptions.
> If the exception is thrown under RecordReader.next() then RecordReader implementations
should automatically skip to the start of a subsequent record.
> Exceptions in map and reduce implementations can simply be logged, unless they happen
under RecordWriter.write().  Cancelling partial output could be hard.  So such output errors
will still result in task failure.
> This behaviour should be optional, but enabled by default.  A count of errors per task
and job should be maintained and displayed in the web ui.  Perhaps if some percentage of records
(>50%?) result in exceptions then the task should fail.  This would stop jobs early that
are misconfigured or have buggy code.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message