hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-3680) Graceful queue capacity reclaim without KilledTaskAttempts
Date Tue, 19 May 2015 14:07:00 GMT
Hari Sekhon created YARN-3680:
---------------------------------

             Summary: Graceful queue capacity reclaim without KilledTaskAttempts
                 Key: YARN-3680
                 URL: https://issues.apache.org/jira/browse/YARN-3680
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: applications, capacityscheduler, resourcemanager, scheduler
    Affects Versions: 2.6.0
         Environment: HDP 2.2.4
            Reporter: Hari Sekhon


Request to allow graceful reclaim of queue resources by waiting until running containers finish
naturally rather than killing them.

For example if you were to dynamically reconfigure Yarn queue capacity/maximum-capacity decreasing
one queue, then containers in that queue start getting killed (and pre-emption is not configured
on this cluster) - instead of containers being allowed to finish naturally and just having
those freed resources no longer be available for new tasks of that job.

This is relevant if there are non-idempotent changes being done by a task that can cause issues
if the task is half competed and then run task killed and re-run from the beginning later.
For example I bulk index to Elasticsearch with uniquely generated IDs since the source data
doesn't have any key or even compound key that is unique. This means if a task sends half
it's data and then is killed and starts again it introduces a large number of duplicates into
the ES index without any mechanism to dedupe later other than rebuilding the entire index
from scratch which is hundreds of millions of docs multiplied by many many indices.

I appreciate this is a serious request and could cause problems with long running services
never returning their resources... so there needs to be some kind of interaction of variables
or similar to separate the indefinitely running tasks for long lived services from the finite-runtime
analytic job tasks with some sort of time-based safety cut off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message