Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C939A1751B for ; Thu, 21 May 2015 10:20:27 +0000 (UTC) Received: (qmail 12154 invoked by uid 500); 21 May 2015 10:20:27 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 12106 invoked by uid 500); 21 May 2015 10:20:27 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 12092 invoked by uid 99); 21 May 2015 10:20:27 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2015 10:20:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 708F7182865 for ; Thu, 21 May 2015 10:20:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.971 X-Spam-Level: X-Spam-Status: No, score=0.971 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id lz_O1AOU0z-D for ; Thu, 21 May 2015 10:20:18 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with SMTP id 94561204F2 for ; Thu, 21 May 2015 10:20:18 +0000 (UTC) Received: (qmail 12075 invoked by uid 99); 21 May 2015 10:20:18 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2015 10:20:18 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 499ABE35CB; Thu, 21 May 2015 10:20:18 +0000 (UTC) From: uce To: issues@flink.incubator.apache.org Reply-To: issues@flink.incubator.apache.org Message-ID: Subject: [GitHub] flink pull request: [FLINK-1954] [FLINK-1636] [runtime] Improve pa... Content-Type: text/plain Date: Thu, 21 May 2015 10:20:18 +0000 (UTC) GitHub user uce opened a pull request: https://github.com/apache/flink/pull/705 [FLINK-1954] [FLINK-1636] [runtime] Improve partition not found error handling **Problem**: cancelling of tasks sometimes leads to misleading error messages about "not found partitions". This is an artifact of task cancelling. If a task (consumer) consumes data from another remote task (producer), its sends a partition request over the network. If the producer fails concurrently with this request, the request returns with a PartitioNotFoundException to the consumer. If this error message is received *before* the consumer is cancelled (as a result of the failing producer), you see the misleading error being attributed to the consumer. This makes it hard to trace the root cause of the problem (the failing producer). **Solution**: when a consumer receives a remote PartitionNotFoundException, it asks the central job manager whether the producer is still running or has failed. If the producer is still running, the partition request is send again (using an exponential back off with default max back off of 3s). If the following requests fail again, the consumer fails with a PartitionNotFoundException. If the producer has failed, the consumer is cancelled. If the producer is not running and has not failed, there is a bug either in the consumer task setup (e.g. requesting a non-existing result) or in the network stack (e.g. unsafe publication of produced results), in which case the error is attributed to the consumer. --- The new Akka messages introduced with this change are only exchanged in error cases and don't affect normal operation. Normal operation (not affected by this change): ``` TM1 => TM2: request result TM2 => TM1: result ``` Error case: ``` TM1=>TM2: request result TM2=>TM1: PartitionNotFoundException TM1=>JM: check partition state JM=>TM1: retrigger request -OR- cancel consumer ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/incubator-flink partition_not_found-1636 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/705.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #705 ---- commit a480a638056e4f5300573f543acf46cd239f2674 Author: Ufuk Celebi Date: 2015-05-11T14:34:55Z [FLINK-1954] [FLINK-1636] [runtime] Improve partition not found error handling Problem: cancelling of tasks sometimes leads to misleading error messages about "not found partitions". This is an artifact of task cancelling. If a task (consumer) consumes data from another remote task (producer), its sends a partition request over the network. If the producer fails concurrently with this request, the request returns with a PartitioNotFoundException to the consumer. If this error message is received *before* the consumer is cancelled (as a result of the failing producer), you see the misleading error being attributed to the consumer. This makes it hard to trace the root cause of the problem (the failing producer). Solution: when a consumer receives a remote PartitionNotFoundException, it asks the central job manager whether the producer is still running or has failed. If the producer is still running, the partition request is send again (using an exponential back off). If the following requests fail again, the consumer fails with a PartitionNotFoundException. If the producer has failed, the consumer is cancelled. If the producer is not running and has not failed, there is a bug either in the consumer task setup (e.g. requesting a non-existing result) or in the network stack (e.g. unsafe publication of produced results), in which case the error is attributed to the consumer. --- The new Akka messages introduced with this change are only exchanged in error cases and don't affect normal operation. Normal operation (not affected by this change): - TM1=>TM2: request result - TM2=>TM1: result Error case: - TM1=>TM2: request result - TM2=>TM1: PartitionNotFoundException - TM1=>JM: check partition state - JM=>TM1: retrigger request -OR- cancel consumer ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---