hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
Date Wed, 14 Jan 2015 23:30:37 GMT

    [ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277880#comment-14277880

Vinod Kumar Vavilapalli commented on YARN-2928:


bq. (1) While it may be faster to allocate with the per-node companions, capacity-wise you
would end up spending more capacity with the per-node approach. Since these per-node companions
are always up although they may be idle for large amount of time. So if capacity is a concern
you may lose out. Under what circumstances would per-node companions be more advantageous
in terms of capacity?
Agreed, we will have to carve out some capacity for the per-node companions. I see some sort
of static allocation like 1GB similar to NodeManager. I've never seen anyone change the NM
capacity as it usually simply forgets things or persists state to local store. The per-node
agent can also take the same approach - a limited heap, and forget or spill over to the Timeline
Storage (e.g. HBase). Only when we want to utilize some memory for short term aggregations,
capacity will be a concern. The other point is that we anyways have to carve out this capacity
for things like YARN-2965.

bq. (2) I do have a question about the work-preserving aspect of the per-node ATS companion.
One implication of making this a per-node thing (i.e. long-running) is that we need to handle
the work-preserving restart. What if we need to restart the ATS companion? Since other YARN
daemons (RM and NM) allow for work-preserving restarts, we cannot have the ATS companion break
that. So that seems to be a requirement?
Yes, recoverability is a requirement for ALA. I'd design it such that it is the responsibility
of each app's aggregator (living inside the node agent) instead of of the node-agent itself.

bq. (3) We still need to handle the lifecycle management aspects of it. Previously we said
that when RM allocates an AM it would tell the NM so the NM could spawn the special container.
With the per-node approach, the RM would still need to tell the NM so that the NM can talk
to the per-node ATS companion to initialize the data structure for the given app.
Yes again. That doesn't change. And it would exactly work the way you said - at no place in
the system will it be assumed that the aggregator is running per node - except for the final
'launcher' who launches the aggregator.

bq. These are quick observations. While I do see value in the per-node approach, it's not
totally clear how much work it would save over the per-app approach given these observations.
What do you think?
Like I mentioned, it won't save anything. It does two things in my mind (1) Let us focus on
the wire up first without thinking about scheduling aspects in RM and (2) Let's us figure
out other parallel efforts like YARN-1012, YARN-2965, YARN-2984, YARN-2141 can be unified
in terms of per-node stats collection.

> Application Timeline Server (ATS) next gen: phase 1
> ---------------------------------------------------
>                 Key: YARN-2928
>                 URL: https://issues.apache.org/jira/browse/YARN-2928
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Critical
>         Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf
> We have the application timeline server implemented in yarn per YARN-1530 and YARN-321.
Although it is a great feature, we have recognized several critical issues and features that
need to be addressed.
> This JIRA proposes the design and implementation changes to address those. This is phase
1 of this effort.

This message was sent by Atlassian JIRA

View raw message