hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3039) [Aggregator wireup] Implement ATS app-appgregator service discovery
Date Tue, 17 Mar 2015 16:27:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365437#comment-14365437
] 

Junping Du commented on YARN-3039:
----------------------------------

Thanks [~sjlee0], [~zjshen] and [~jianhe] for review and comments!
bq.TimelineClient.java, l.59-64: I would vote for simply removing the method
Yes. Forgot to delete it after comment and test.

bq.l.413: nit: space
Fixed.

bq.l.416: should we throw a RuntimeException? how about wrapping it in a YarnException?
Sounds good. Updated.

bq.yarn-default.xml,It still has 20 as the thread count default.
Sorry. I fixed the other place but forget to update value here. Updated in v6 patch.

bq. I’m not 100% certain that there is an issue, but I think we need to be careful about
always removing registered/known aggregators when the app is done. Failure to do so would
quickly ramp up the NM memory, and it would become unusable. Again, I think the current code
will properly remove a finished app from registered/known aggregator maps. It's just a reminder
to be careful about this.
Agree. Thanks for reminding here. I think it should be fine for usual case (app finished smoothly)
and we need to track this issue also in AM/aggregator failed over case. Will have a comment
in YARN-3038.

bq. Aggregator<->NM is the server-side protocol. The related classes and .proto should
be put in yarn-server-common, like ResourceTracker.
Right. Thanks for reminding here. I was thinking aggregator belongs to application/user land.
According to current layout, it belongs to server land now so I will get it updated in v6
patch (huge effort though).

bq. I'm not sure if we need to change ApplicationMaster. Even currently, the client will retry
on connection problem with the server.
It is still necessary for now. Without a separate thread, the main thread of AM will get blocked
there and AM-RM heartbeat cannot start which cause deadlock mentioned by [~Naganarasimha]
earlier. 
A cleaner solution should let TimelineClient have non-blocking call (and event loop) which
I already put on my TODO list (and comments in patch). Let's keep things there for now so
end-to-end can work. 

bq.  IMHO, for the client, we can set TimelineClient as the listener of AMRMClient by adding
AMRMClient#registerTimelineAggregatorAddressListener(TimelineClient client). Therefore, the
user just needs to make one additional call "registerTimelineAggregatorAddressListener" to
combine AMRMClient and TimelineClient. Inside TimelineClientImpl, there's no need to wait
in loop for the address update, and to override onAggregatorAddressUpdated. AMRMClientImpl
call it when get the update from heartbeat response.
I think loop for waiting address update or service available is still necessary because even
address get updated which could still be a broken address and retry logic will be needed in
this case. Another useful case that the retry logic can help is when service address is not
being updated but service get recovered for some reason. May be the current way to register
a callback to AMRMClient (or something else for NodeManager) and listen address updated is
enough for now, and we can improve a bit later (together with non-blocking call in TimelineClient)
with adding a wake up notification if heartbeat comes between sleep interval?

bq. In TimelineClientImpl, it seems to be not necessary to add additional retry logic, as
the client has the retry logic as ClientFilter yet.
I am not sure if Jersey client support retry on different URI. From my quick glance, it sounds
like the client will bind with a fixed URI. Please let me know if my understanding is not
correct. 

bq. BTW, the last patch should no longer apply in TimelineClientImpl. It needs to be rebased.
Thanks for notice. Sounds like we have security change in trunk, get it rebased in v6 version.

bq. Here, we should not save the current app state in state-store, otherwise recovery logic
will break - it's currently not expecting non-final state on recovery.
Good comments. How about we do persistent only on aggregator info but leave anything else
as whatever it has (like Init status). 

bq. I think RMAppAggregatorUpdateTransition should be invoked only on certain application
states that make sense to handle. better not make every app state to handle this transition.
that'll cause unnecessary writes.
It only write once (in most cases) for app's whole life cycle. Does that sound too much for
RMStateStore?

bq. Given it may needs more treatment for state preserving, shall we spin off the related
code change and defer it to a later jira? I think it shouldn't block our next milestone.
Sounds like a good plan. I will try to separate it out from v6 patch.

Will deliver v6 patch soon to reflect above comments.

> [Aggregator wireup] Implement ATS app-appgregator service discovery
> -------------------------------------------------------------------
>
>                 Key: YARN-3039
>                 URL: https://issues.apache.org/jira/browse/YARN-3039
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Junping Du
>         Attachments: Service Binding for applicationaggregator of ATS (draft).pdf, Service
Discovery For Application Aggregator of ATS (v2).pdf, YARN-3039-no-test.patch, YARN-3039-v2-incomplete.patch,
YARN-3039-v3-core-changes-only.patch, YARN-3039-v4.patch, YARN-3039-v5.patch
>
>
> Per design in YARN-2928, implement ATS writer service discovery. This is essential for
off-node clients to send writes to the right ATS writer. This should also handle the case
of AM failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message