Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B10212009EE for ; Wed, 18 May 2016 15:31:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AF76A160A1B; Wed, 18 May 2016 13:31:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0BA791609B0 for ; Wed, 18 May 2016 15:31:13 +0200 (CEST) Received: (qmail 37003 invoked by uid 500); 18 May 2016 13:31:13 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 36953 invoked by uid 99); 18 May 2016 13:31:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2016 13:31:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E74692C1F69 for ; Wed, 18 May 2016 13:31:12 +0000 (UTC) Date: Wed, 18 May 2016 13:31:12 +0000 (UTC) From: "Sunil G (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4494) Recover completed apps asynchronously MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 18 May 2016 13:31:14 -0000 [ https://issues.apache.org/jira/browse/YARN-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288950#comment-15288950 ] Sunil G commented on YARN-4494: ------------------------------- Hi [~kasha] As per the problem statement, if we are starting to recover complete apps asynchronously, we may not know when this recovery will be completed. So if we are getting a query (getApplication/Attempt etc) during this brief recovery period, we could immediately try to recover the queried app from client (also by blocking the client rpc call), and serve the metrics/state etc. So it wont be a lazy recover when there is a request, we can immediately recover and serve it. bq.If yes, do we recover everything when someone requests all apps? How about apps that match a specific category? I was thinking in same line early. But we may block the client call for a long time here till all apps are recovered. There are two options here, 1) block the client call till all apps are recovered (it may be too long, and timeour may happen) 2) error message/exception can be thrown to client indicating that recovery is in progress. Both these are not very clean solutions. But we have seen some de-merits of recovering completed apps (in case of thousands of completed apps). TO avoid this issue, max-completed applications were configured lesser. cc/[~rohithsharma] [~kasha], pls share your thoughts. > Recover completed apps asynchronously > ------------------------------------- > > Key: YARN-4494 > URL: https://issues.apache.org/jira/browse/YARN-4494 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Reporter: Jun Gong > Assignee: Jun Gong > > With RM HA enabled, when recovering apps, recover completed apps asynchronously. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org