Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4644D200C77 for ; Mon, 1 May 2017 21:12:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 45199160BA0; Mon, 1 May 2017 19:12:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 83595160BB9 for ; Mon, 1 May 2017 21:12:08 +0200 (CEST) Received: (qmail 36849 invoked by uid 500); 1 May 2017 19:12:07 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 36838 invoked by uid 99); 1 May 2017 19:12:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 May 2017 19:12:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 51051C0DF9 for ; Mon, 1 May 2017 19:12:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id TUhIElTHsT98 for ; Mon, 1 May 2017 19:12:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 77D065FB49 for ; Mon, 1 May 2017 19:12:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9E7C0E06BF for ; Mon, 1 May 2017 19:12:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1E32C21DE3 for ; Mon, 1 May 2017 19:12:04 +0000 (UTC) Date: Mon, 1 May 2017 19:12:04 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@apex.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (APEXCORE-714) Reusable instance operator recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 01 May 2017 19:12:09 -0000 [ https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991331#comment-15991331 ] ASF GitHub Bot commented on APEXCORE-714: ----------------------------------------- GitHub user PramodSSImmaneni opened a pull request: https://github.com/apache/apex-core/pull/522 APEXCORE-714 Adding a new recovery mode where the operator instance before a failure event can be reused when recovering from an upstream operator failure [Review Only] You can merge this pull request into a Git repository by running: $ git pull https://github.com/PramodSSImmaneni/apex-core APEXCORE-714 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #522 ---- commit 72f56dda9d4d244bbbf23ccde657435b94267362 Author: Pramod Immaneni Date: 2017-03-08T03:29:02Z APEXCORE-714 Adding a new recovery mode where the operator instance before a failure event can be reused when recovering from an upstream operator failure ---- > Reusable instance operator recovery > ----------------------------------- > > Key: APEXCORE-714 > URL: https://issues.apache.org/jira/browse/APEXCORE-714 > Project: Apache Apex Core > Issue Type: Improvement > Reporter: Pramod Immaneni > Assignee: Pramod Immaneni > > In a failure scenario, when a container fails, it is redeployed along with all the operators in it. The operators downstream to these operators are also redeployed within their containers. The operators are restored from their checkpoint and connect to the appropriate point in the stream according to the processing mode. In at least once mode, for example, the data is replayed from the same checkpoint > Restoring an operator state from checkpoint could turn out to be a costly operation depending on the size of the state. In some use cases, based on the operator logic, when there is an upstream failure, without restoring the operator from checkpoint and reusing the current instance, will still produce the same results with the data replayed from the last fully processed window. The operator state can remain the same as it was before the upstream failure by reusing the same operator instance from before and only the streams and window reset to the window after the last fully processed window to guarantee the at least once processing of tuples. If the container where the operator itself is running goes down, it would need to be restored from the checkpoint of course. This scenario occurs in some batch use cases with operators that have a large state. > I would like to propose adding the ability for a user to explicitly identify operators to be of this type and the corresponding functionality in the engine to handle their recovery in the way described above by not restoring their state from checkpoint, reusing the instance and restoring the stream to the window after the last fully processed window for the operator. When operators are not identified to be of this type, the default behavior is what it is today and nothing changes. > I have done some prototyping on the engine side to ensure that this is possible with our current code base without requiring a massive overhaul, especially the restoration of the operator instance within the Node in the streaming container, the re-establishment of the subscriber stream to a window in the buffer server where the publisher (upstream) hasn't yet reached as it would be restarting from checkpoint and have been able to get it all working successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346)