hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
Date Thu, 15 Aug 2013 16:44:50 GMT

    [ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741165#comment-13741165
] 

Bikas Saha commented on YARN-1055:
----------------------------------

First of all, like folks have already agreed. This is fundamentally an Oozie problem.  I dont
want to add an option to YARN that does not make sense for YARN by itself. If YARN needs to
hack a workaround to fix an Oozie problem, I would also like to see what Oozie is doing on
its part of the bargain. What is the Oozie jira that fixes this fundamental problem with Oozie?

With the correct settings, this may be a problem only on rare occasions for an Oozie workflow
when an action-am node crashes. IMO its an ok compromise for the short term while YARN is
still not GA.

This issue exists since YARN started and since we started working on RM restart. If it hasnt
been a catastrophic issue till now then IMO it can wait for some more time till we complete
YARN-556. RM restart is work in active progress and I dont understand why we need to hack
an API together when we are already tracking a proper solution in YARN-556. YARN and Hadoop-1
are different enough that 1-1 regression matching may not always make sense. Even when it
does, it will be a regression only when YARN goes GA. Until then all of this is work in progress
and users need to be aware of limitations that are known and being fixed. The cornerstone
of the beta release that we all have worked so hard for is making a viable and stable API
that we want to support. Adding a short term API would go against the basic premise of the
beta release.

Any workaround stop gap etc requires code change and maintenance of that code for future code
changes. The request here is for an additional API in AppSubmissionContext that helps Oozie
work around its lack of book-keeping. Once YARN goes out with beta then this API will have
to be maintained forever since removing an API is backwards incompatible. Given that we are
already committed to fixing this via YARN-556, adding a short term API that will need to be
maintained forever is a disaster and I dont see enough value being added to suffer through
it. We are better off not spending more time on this and devoting that energy on things like
YARN-556 that make real improvements for everyone.

I really hope this clarifies my position and assures you that we are committed solving the
problem in the correct manner.
                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and
RM currently relies on the max-attempts config; tolerating AM failures requires it to be >
1 and tolerating RM failure/restart requires it to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message