Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 97502200D08 for ; Wed, 23 Aug 2017 04:48:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 95CDF166FC9; Wed, 23 Aug 2017 02:48:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DB1D116718E for ; Wed, 23 Aug 2017 04:48:04 +0200 (CEST) Received: (qmail 71381 invoked by uid 500); 23 Aug 2017 02:48:04 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 71370 invoked by uid 99); 23 Aug 2017 02:48:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Aug 2017 02:48:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 80F171A16EB for ; Wed, 23 Aug 2017 02:48:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id qK-buUrtTvqB for ; Wed, 23 Aug 2017 02:48:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 5A9225FAF7 for ; Wed, 23 Aug 2017 02:48:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 96A1AE0E1C for ; Wed, 23 Aug 2017 02:48:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3BF292537F for ; Wed, 23 Aug 2017 02:48:00 +0000 (UTC) Date: Wed, 23 Aug 2017 02:48:00 +0000 (UTC) From: "Julia (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (REEF-1870) Kill slower Evaluators in IMRU after timeout in data loading MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 23 Aug 2017 02:48:05 -0000 Julia created REEF-1870: --------------------------- Summary: Kill slower Evaluators in IMRU after timeout in data loading Key: REEF-1870 URL: https://issues.apache.org/jira/browse/REEF-1870 Project: REEF Issue Type: Improvement Reporter: Julia The job was submitted totally 4 retriesIn each retry, most of the Jobs can finish data downloading/deserialization within 6-30 minutes. There are about 3 evaluators which are very slow. The slowest one took about 2-8 hours to download data/deserialization in each retry. The retry was triggered after 30 min timeout (configurable)Driver cannot send close event to those slower evaluators before they complete data loading and then send IRunningTask event to driver. After long running time, the Job was killed. A simple band-aid is to kill the evaluators from which we do not receive RunningTask after the 30 min timeout along with cancelling the RunningTasks that have been received. Its needless to wait 8 hours to cancel the RunningTasks that just complete downloading/deserializing the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029)