hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
Date Fri, 12 Jan 2018 23:04:03 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe reassigned MAPREDUCE-7022:

    Assignee: Johan Gustavsson  (was: Jason Lowe)

Thanks for updating the patch!

The common code between the listener's fatalError and fatalErrorFailFast should be factored
out, otherwise someone is going to come along and update one without updating the other. 
Right now they are almost complete copies of each other.

There are many places in the code where it refers to "job" in fast fail when it really should
be "task".  A failing task does not necessarily mean the job fails.  I think it would be more
clear if FastFail and FailFast are replaced with FailTask in method names and fields.

It looks like TestTaskImpl is sending T_ATTEMPT_FAILED messages without them being the proper
event type, so event casting in the task transition will fail.

TaskUmbilicalProtocol new doc change refers to failing the job but it actually fails the task.

Nit: I think it would be cleaner if confs were rooted at mapreduce.job.local-fs.single-disk-limit,
e.g.: mapreduce.job.local-fs.single-disk-limit.bytes.

The boolean kill default value has a comment stating negative values disable the limit.

The disk checker should always log rather than only logging when it is not killing.  That
way important info relative to the task attempt is logged whether the task is killed or not.
 It should arguably be logged as a WARN if not killing the task and FATAL if we do.

> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, MAPREDUCE-7022.003.patch,
MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, MAPREDUCE-7022.006.patch
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue tasks based
on writes to local disk writes. In our environment are we mainly run Hive based jobs we noticed
that this counter and the size of the local scratch dirs were very different. We had tasks
where BYTES_WRITTEN counter were at 300Gb and where it was at 10Tb both producing around 200Gb
on local disk, so it didn't help us much. So to extend this feature tasks should monitor local
scratchdir size and fail if they pass the limit. In these cases the tasks should not be retried
either but instead the job should fast fail.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message