Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C36C8188BB for ; Fri, 4 Mar 2016 10:02:41 +0000 (UTC) Received: (qmail 11397 invoked by uid 500); 4 Mar 2016 10:02:41 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 11347 invoked by uid 500); 4 Mar 2016 10:02:41 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 11235 invoked by uid 99); 4 Mar 2016 10:02:41 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Mar 2016 10:02:41 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D6E4A2C1F68 for ; Fri, 4 Mar 2016 10:02:40 +0000 (UTC) Date: Fri, 4 Mar 2016 10:02:40 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-4602) Scalable and Simple Message Service for YARN application MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4602: ----------------------------- Description: We are proposing to support MR AM restart with work preserving in MAPREDUCE-6608 (https://issues.apache.org/jira/browse/MAPREDUCE-6608) that when AM get failed for some reason, the inflight tasks will keep running/pending until new AM attempt comes back to continue. One of prerequisite is tasks should know where the new AM attempt get launched so TaskUmbilicalProtocol can get retry between clients and new server. There could be the same requirement for other applications running on YARN too. Some application decide to handle message delivery itself, e.g. Long running services can leverage Slider agent to notify messages back and forth. However, vanilla applications on YARN is hard to achieve this because Hadoop RPC mechanism essentially is a single way of communication. Although two directions mechanism like heartbeats (between NM-RM or AM-RM) can get built on top of it, it make less sense to build the same mechanism between AM and its application containers - or it need to handle massive of client connections in AM which could be the new bottleneck for scalability and very complicated in state maintaining. Instead, we need a new message mechanism that is simple and scalable. was: Currently, mostly communications among YARN daemons, services and applications are go through RPC. In almost all cases, logic running inside of containers are RPC client but not server because it get launched inflight. The only special case is AM container, because it get launched earlier than any other containers so it can be RPC server and tell new coming containers server address in application logic (like MR AM). The side effects are: 1. When AM container get failed, the new AM attempts will get launched with new address/port, so previous RPC are broken. 2. Application's requirement are variable, there could be other dependency between containers (not AM), so some container failed over will affect other containers' running logic. It is better to have some message/notification mechanism between containers for handle above cases. > Scalable and Simple Message Service for YARN application > -------------------------------------------------------- > > Key: YARN-4602 > URL: https://issues.apache.org/jira/browse/YARN-4602 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, resourcemanager > Reporter: Junping Du > Assignee: Junping Du > > We are proposing to support MR AM restart with work preserving in MAPREDUCE-6608 (https://issues.apache.org/jira/browse/MAPREDUCE-6608) that when AM get failed for some reason, the inflight tasks will keep running/pending until new AM attempt comes back to continue. One of prerequisite is tasks should know where the new AM attempt get launched so TaskUmbilicalProtocol can get retry between clients and new server. > There could be the same requirement for other applications running on YARN too. Some application decide to handle message delivery itself, e.g. Long running services can leverage Slider agent to notify messages back and forth. However, vanilla applications on YARN is hard to achieve this because Hadoop RPC mechanism essentially is a single way of communication. Although two directions mechanism like heartbeats (between NM-RM or AM-RM) can get built on top of it, it make less sense to build the same mechanism between AM and its application containers - or it need to handle massive of client connections in AM which could be the new bottleneck for scalability and very complicated in state maintaining. Instead, we need a new message mechanism that is simple and scalable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)