Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D691A200D44 for ; Mon, 6 Nov 2017 01:02:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D5076160BFE; Mon, 6 Nov 2017 00:02:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 28613160BE7 for ; Mon, 6 Nov 2017 01:02:04 +0100 (CET) Received: (qmail 15660 invoked by uid 500); 6 Nov 2017 00:02:03 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 15648 invoked by uid 99); 6 Nov 2017 00:02:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Nov 2017 00:02:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 67D83180778 for ; Mon, 6 Nov 2017 00:02:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id lDWr-YpF4uOE for ; Mon, 6 Nov 2017 00:02:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 323485FDDC for ; Mon, 6 Nov 2017 00:02:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6F65DE0D58 for ; Mon, 6 Nov 2017 00:02:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1B47023F05 for ; Mon, 6 Nov 2017 00:02:00 +0000 (UTC) Date: Mon, 6 Nov 2017 00:02:00 +0000 (UTC) From: "Markus Weimer (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (REEF-1949) Closing ThreadPoolStage before tasks are finished MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 06 Nov 2017 00:02:05 -0000 [ https://issues.apache.org/jira/browse/REEF-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239778#comment-16239778 ] Markus Weimer commented on REEF-1949: ------------------------------------- This could be another instance where the locking in the bridge creates a bottleneck. [~juliaw], what do you think? > Closing ThreadPoolStage before tasks are finished > ------------------------------------------------- > > Key: REEF-1949 > URL: https://issues.apache.org/jira/browse/REEF-1949 > Project: REEF > Issue Type: Bug > Components: REEF Driver > Affects Versions: 0.17 > Reporter: Pei Jiang > > In EvaluatorManager.onEvaluatorDone(), > {code} > // This relies on the dispatcher to call the CompletedEvaluator handler. > this.messageDispatcher.onEvaluatorCompleted(new CompletedEvaluatorImpl(this.evaluatorId)); > // This will close the dispatcher, which in turns shut down the executor in ThreadPoolStage. > this.close(); > {code} > Since in onEvaluatorCompleted the message sending task is submitted to an executor, there is no guarantee that the CompletedEvaluator message will be sent before the termination of the executor in this.close() call. When this happens, the CompletedEvaluator handler will not be triggered so the driver will think that some evaluators are alive and hence hang. > Relevant logs: > {code} > Nov 01, 2017 11:05:57 PM org.apache.reef.wake.impl.ThreadPoolStage close > SEVERE: Closing ThreadPoolStage EvaluatorMessageDispatcher:container_1508975419755_0006_01_000004: Executor did not terminate in 1,000 ms. Dropping 2 tasks > Nov 01, 2017 11:05:57 PM org.apache.reef.wake.impl.ThreadPoolStage close > SEVERE: Closing ThreadPoolStage EvaluatorMessageDispatcher:container_1508975419755_0006_01_000004: Executor failed to terminate. > End of LogType:driver.stderr > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)