Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9998A200C65 for ; Sat, 15 Apr 2017 01:08:47 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 92FC0160BA3; Fri, 14 Apr 2017 23:08:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DB266160B8C for ; Sat, 15 Apr 2017 01:08:46 +0200 (CEST) Received: (qmail 76852 invoked by uid 500); 14 Apr 2017 23:08:46 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 76841 invoked by uid 99); 14 Apr 2017 23:08:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Apr 2017 23:08:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6FD461804F7 for ; Fri, 14 Apr 2017 23:08:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id k_hrzWXQZtkh for ; Fri, 14 Apr 2017 23:08:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 45D275FBA7 for ; Fri, 14 Apr 2017 23:08:44 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 95421E0BCD for ; Fri, 14 Apr 2017 23:08:43 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 637B421B46 for ; Fri, 14 Apr 2017 23:08:42 +0000 (UTC) Date: Fri, 14 Apr 2017 23:08:42 +0000 (UTC) From: "Thomas Weise (JIRA)" To: dev@apex.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (APEXCORE-703) Window processing timeout for finished/undeployed container MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 14 Apr 2017 23:08:47 -0000 [ https://issues.apache.org/jira/browse/APEXCORE-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969655#comment-15969655 ] Thomas Weise commented on APEXCORE-703: --------------------------------------- Operators will be undeployed and removed from the plan when they raise the shutdown exception, which happens in Beam when there is no more input. The hunch is that the operator "1" has gone idle and undeployed, but still flagged as timeout. That's what needs to be investigated further. > Window processing timeout for finished/undeployed container > ----------------------------------------------------------- > > Key: APEXCORE-703 > URL: https://issues.apache.org/jira/browse/APEXCORE-703 > Project: Apache Apex Core > Issue Type: Bug > Affects Versions: 3.5.0 > Reporter: Daniel Halperin > > Using Apex 3.5.0 with Apache Beam, I have a 10-container pipeline. The first container, id #1, finishes and gets undeployed at 12:41:10 PM. > Then, 60s later (at 12:42:10 PM), Apex decides that container is blocked because no data has been received for 60s, declares failure, and restarts it. > This would seem to be a bug -- shouldn't finished and undeployed operators be deregistered from the timeout logic that is detecting stuck operators? > Log below > {code} > Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer processHeartbeatResponse > INFO: Undeploy request: [1] > Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer undeploy > INFO: Undeploy complete. > Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager updateRecoveryCheckpoints > WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked committed window ffffffffffffffff, recovery window ffffffffffffffff, current time 1492198930012, last window id change time 1492198869957, window processing timeout millis 60000 > Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager updateCheckpoints > INFO: Blocked operator PTOperator[id=1,name=TextIO.Read/Read] container PTContainer[id=1(container-6),state=ACTIVE] time 60055ms > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.engine.StreamingContainer processHeartbeatResponse > INFO: Received shutdown request > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StramLocalCluster run > INFO: Container container-6 restart. > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager scheduleContainerRestart > INFO: Initiating recovery for container-6@localhost > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager updateRecoveryCheckpoints > WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked committed window ffffffffffffffff, recovery window ffffffffffffffff, current time 1492198931015, last window id change time 1492198869957, window processing timeout millis 60000 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)