hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
Date Tue, 23 Feb 2016 18:00:22 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159319#comment-15159319
] 

Junping Du commented on MAPREDUCE-6608:
---------------------------------------

Thanks [~srikanth.sampath] for updating the design doc and uploading an outstanding demo patch!
Sorry for reply a little late as just come back from a vacation... Finally, I got chance to
review the latest document and the demo patch. 

+1 on Vinod's proposal of separating write and read path. This solution is even better than
my proposal (HDFS way) above as no single point access means better scalability. The only
problem here is the implementation is more complicated as it involves new RPC service in NM
(client side is task) and more payload between NM-RM heartbeat, so we should separate it out
a dedicated YARN JIRA to track the work.

Other quick comments on the design doc:

bq.  The work preserving feature of an MR Application can be set at an application level,
when the application is submitted.
Sounds good. We can involve a new MR config to switch on/off this feature (off by default).
However, I didn't see any implementation on this in demo patch and I think we should add it
in the beginning as we want to keep old behavior (code path) unchanged in case feature is
off.

bq. When the AM starts up, the registry operations is started as a service. An AM creates
a service record id being the JobId and persistence being at the application level. It then
stores the address(host, port) as an internal endpoint.
Beside we need to replace the read path of registry service, another point is we don't necessary
to keep the first attempt AM info which could saving most of overhead we are adding here as
most applications are expected to end with single attempt. Isn't it?

bq. Currently, YarnChild uses positional arguments as parameters. This will be enhanced to
use named arguments as parameters. For work preserving jobs, the path to the service record
is passed as the parameter to determine the address of the AM.
Agree that named argument sounds better. However, this way has work for a long time for MapReduce
project and we won't prefer to change unless we find some issue or bug. For path to service
record, we need keep consistent with our decision on read path.

bq. Thus UmbilicalWithRetries is a wrapper over Umbilical with retries implemented. Depending
on whether the AM is workpreserving or not, a factory method creates either a vanilla umbilical
or one with retries.
UmbilicalWithRetries should follow other existing practice (for RPC client retry during service
down time) that to create a RetryProxy with FailoverProxyProvider (may be call it as MRAMProxy)
for task attempt to contact with new attempt instance for AM. 

TaskManagement part look good to me.

> Work Preserving AM Restart for MapReduce
> ----------------------------------------
>
>                 Key: MAPREDUCE-6608
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Srikanth Sampath
>            Assignee: Srikanth Sampath
>         Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, WorkPreservingMRAppMaster-2.pdf,
WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].
 We would like to take advantage of this for MapReduce(MR) applications.  There are some challenges
which have been described in the attached document and few options discussed.  We solicit
feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message