Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 692E7200C1C for ; Wed, 1 Feb 2017 01:45:03 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 67B2B160B5F; Wed, 1 Feb 2017 00:45:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B3252160B52 for ; Wed, 1 Feb 2017 01:45:02 +0100 (CET) Received: (qmail 106 invoked by uid 500); 1 Feb 2017 00:45:01 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 99994 invoked by uid 99); 1 Feb 2017 00:45:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2017 00:45:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 73E68189AEB for ; Wed, 1 Feb 2017 00:45:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id vUVrMTpJOWoq for ; Wed, 1 Feb 2017 00:45:00 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 925015F4EE for ; Wed, 1 Feb 2017 00:45:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5100FE0536 for ; Wed, 1 Feb 2017 00:44:53 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 93E502528C for ; Wed, 1 Feb 2017 00:44:52 +0000 (UTC) Date: Wed, 1 Feb 2017 00:44:52 +0000 (UTC) From: "Gour Saha (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-6136) YARN registry service should avoid scanning whole ZK tree for every container/application finish MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Feb 2017 00:45:03 -0000 [ https://issues.apache.org/jira/browse/YARN-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847800#comment-15847800 ] Gour Saha commented on YARN-6136: --------------------------------- [~wangda] FYI, Slider today uses the following path - {code} /registry/users/{user-id}/services/org-apache-slider/{app-name}/components/{container-id} {code} > YARN registry service should avoid scanning whole ZK tree for every container/application finish > ------------------------------------------------------------------------------------------------ > > Key: YARN-6136 > URL: https://issues.apache.org/jira/browse/YARN-6136 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Critical > > In existing registry service implementation, purge operation triggered by container finish event: > {code} > public void onContainerFinished(ContainerId id) throws IOException { > LOG.info("Container {} finished, purging container-level records", > id); > purgeRecordsAsync("/", > id.toString(), > PersistencePolicies.CONTAINER); > } > {code} > Since this happens on every container finish, so it essentially scans all (or almost) ZK node from the root. > We have a cluster which have hundreds of ZK nodes for service registry, and have 20K+ ZK nodes for other purposes. The existing implementation could generate massive ZK operations and internal Java objects (RegistryPathStatus) as well. The RM becomes very unstable when there're batch container finish events because of full GC pause and ZK connection failure. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org