hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Gustavsson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
Date Thu, 14 Dec 2017 03:23:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290279#comment-16290279
] 

Johan Gustavsson commented on MAPREDUCE-7022:
---------------------------------------------

Thanks for taking the time to review and give detailed feedback [~jlowe]
bq. It's pretty confusing to have both mapreduce.task.local-fs.limit.bytes and mapreduce.task.local-fs.write-limit.bytes.

As you pointed out this is not meant as a single task monitor, but rather a per job single
disk usage monitor. Most likely most of the naming related to it came subconsciously since
this patch was heavily inspired by MAPREDUCE-6489. I'll rename it to something like mapreduce.job.single-disk.limit.bytes
as you pointed out and try to fix the description more descriptive.
bq. This is going to add a disk I/O dependency to every task heartbeat where the task attempt
needs to touch every disk. 
Good point. I like your idea of putting it into a background thread so I'll try to rewrite
it accordingly.
bq. Comments on the code changes
Will try to fix them all. Main reason I introduced the FF key all over the place was to avoid
having to touch the actual state machine, but I think I see your point in how to avoid doing
both, and also clean it up. Also good point that most people probably don't know what ff stands
for out of context so I'll try to make it less cryptic.

Thanks once again, I'll try to have something ready in the next couple of days.

> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue tasks based
on writes to local disk writes. In our environment are we mainly run Hive based jobs we noticed
that this counter and the size of the local scratch dirs were very different. We had tasks
where BYTES_WRITTEN counter were at 300Gb and where it was at 10Tb both producing around 200Gb
on local disk, so it didn't help us much. So to extend this feature tasks should monitor local
scratchdir size and fail if they pass the limit. In these cases the tasks should not be retried
either but instead the job should fast fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message