hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] (YARN-6136) Registry should avoid scanning whole ZK tree for every container/application finish
Date Tue, 31 Jan 2017 22:04:51 GMT

    [ https://issues.apache.org/jira/browse/YARN-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847601#comment-15847601

Wangda Tan commented on YARN-6136:

Per my understanding, existing design of service registry to do the full scan of ZK tree is,
it doesn't have a pre-knowledge about how the znodes are organized for services. So it has
to do full scan of the ZK tree to find the first matched znode which has YARN_ID equals to

To solve the problem, I think we can enforce a rule of how the ZK tree is organized, for example:
For services and:
/yarn-daemons/{rule-name, like RM/NM}/{host:port}
For internal daemons.
With this we can directly locate znode by finished container / app info.

Not sure if is this discussed already, and not sure if there's any other approaches to solve
the issue. [~steve_l], could you please add your thoughts? 

+ [~vinodkv], [~sidharta-s], [~gsaha], [~jianhe].

> Registry should avoid scanning whole ZK tree for every container/application finish
> -----------------------------------------------------------------------------------
>                 Key: YARN-6136
>                 URL: https://issues.apache.org/jira/browse/YARN-6136
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, resourcemanager
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Critical
> In existing registry service implementation, purge operation triggered by container finish
> {code}
>   public void onContainerFinished(ContainerId id) throws IOException {
>     LOG.info("Container {} finished, purging container-level records",
>         id);
>     purgeRecordsAsync("/",
>         id.toString(),
>         PersistencePolicies.CONTAINER);
>   }
> {code} 
> Since this happens on every container finish, so it essentially scans all (or almost)
ZK node from the root. 
> We have a cluster which have hundreds of ZK nodes for service registry, and have 20K+
ZK nodes for other purposes. The existing implementation could generate massive ZK operations
and internal Java objects (RegistryPathStatus) as well. The RM becomes very unstable when
there're batch container finish events because of full GC pause and ZK connection failure.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message