hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3039) [Aggregator wireup] Implement ATS writer service discovery
Date Tue, 24 Feb 2015 15:29:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335010#comment-14335010
] 

Junping Du commented on YARN-3039:
----------------------------------

Thanks [~zjshen] for review and comments!
bq. I think so, too. RM has its own builtin aggregator, and RM directly writes through it.
I have a very basic question here: didn't we want a singleton app aggregator for all app related
events, logs, etc.? Ideally, only this singleton aggregator can have magic to sort out app
info in aggregation. If not, we can even give up current flow "NM(s) -> app aggregator(deployed
on one NM) -> backend" and let NM to talk to backend directly for saving hop for traffic.
Can you clarify more on this?

bq.  in the heartbeat, instead of always sending the snapshot of the aggregator address info,
can we send the incremental information upon any change happens to the aggregator address
table. Usually, the aggregator will not change it place often, such that we can avoid unnecessary
additional traffic in most heartbeats.
That's a very good point for discussion. 
The interesting thing here is only we can compare with info from client (NM), then we can
know what is alternated in server (RM) since last heartbeat. Take token update for example
(populateKeys() in ResourceTrackerService), our current implementation is: we encoded master
keys (ContainerTokenMasterKey and NMTokenMasterKey) known by NM in request, then in response
we can filter out old keys that already known by NM. IMO, this (put everything in request,
and put something/nothing in response) doesn't have any optimization against the way we put
nothing in request and put everything in response, but only turn outbound traffic into inbound
and bring compare logic in server side. Isn't it? Another optimization we can think here is
to let client express its interested app aggregators on the request (with adding them to a
new optional field, e.g. InterestedApps) when it found these info are missing or stale, and
server only loop related app aggregators info in. NM can maintain an interested app aggregator
list, which get updated when first time app's container get launched or app's aggregator info
get stale (may reported in writer/reader's retry logic) and items from list get removed when
received from heartbeat response. Thoughts?

bq. One addition issue related the rm state store: calling it in the update transition may
break the app recovery. The current state instead of the final state will be written into
the store. If RM stops and restarts at this moment, this app can't be recovered properly.
Thanks for reminding on this. This is something I am not 100% sure. However, from recoverApplication()
in RMAppManager, I didn't see we cannot recover app in RUNNING or other state (except final
states, like: killed, finished, etc.). Do I miss anything on this? One missing piece of code
indeed here is I forget to repopulate aggregatorAddr from store in RMAppImpl.recover(), will
add it back in next patch.


> [Aggregator wireup] Implement ATS writer service discovery
> ----------------------------------------------------------
>
>                 Key: YARN-3039
>                 URL: https://issues.apache.org/jira/browse/YARN-3039
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Junping Du
>         Attachments: Service Binding for applicationaggregator of ATS (draft).pdf, YARN-3039-no-test.patch
>
>
> Per design in YARN-2928, implement ATS writer service discovery. This is essential for
off-node clients to send writes to the right ATS writer. This should also handle the case
of AM failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message