From issues-return-39633-archive-asf-public=cust-asf.ponee.io@tez.apache.org Tue Aug 28 00:38:03 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0FA7F180674 for ; Tue, 28 Aug 2018 00:38:02 +0200 (CEST) Received: (qmail 85474 invoked by uid 500); 27 Aug 2018 22:38:02 -0000 Mailing-List: contact issues-help@tez.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.apache.org Delivered-To: mailing list issues@tez.apache.org Received: (qmail 85465 invoked by uid 99); 27 Aug 2018 22:38:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Aug 2018 22:38:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B3C8618035C for ; Mon, 27 Aug 2018 22:38:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Bq1x8bK3wKrz for ; Mon, 27 Aug 2018 22:38:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C7D465F29A for ; Mon, 27 Aug 2018 22:38:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 59E73E00E1 for ; Mon, 27 Aug 2018 22:38:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 05FF02183D for ; Mon, 27 Aug 2018 22:38:00 +0000 (UTC) Date: Mon, 27 Aug 2018 22:38:00 +0000 (UTC) From: "Gopal V (JIRA)" To: issues@tez.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594304#comment-16594304 ] Gopal V commented on TEZ-3984: ------------------------------ Specific sequence of events is - input throws exception. {code} 2018-08-27T17:25:15,579 WARN [TezTR-437616_7273_9_0_0_0 (1520459437616_7273_9_00_000000_0)] runtime.LogicalIOProcessorRuntimeTask: Ignoring exception when closing input calls(cleanup). Exception class=java.io.IOException, message ... {code} Output gets closed for memory recovery {code} 2018-08-27T17:25:15,579 INFO [TezTR-437616_7273_9_0_0_0 (1520459437616_7273_9_00_000000_0)] impl.PipelinedSorter: Reducer 2: Starting flush of map output {code} Sorter pushes event to the output context directly {code} 2018-08-27T17:25:15,990 INFO [TezTR-437616_7273_9_0_0_0 (1520459437616_7273_9_00_000000_0)] impl.PipelinedSorter: Reducer 2: Adding spill event for spill (final update=true), spillId=0 {code} And the Reducer 2 gets the event routed to it. > Shuffle: Out of Band DME event sending causes errors > ---------------------------------------------------- > > Key: TEZ-3984 > URL: https://issues.apache.org/jira/browse/TEZ-3984 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.8.4, 0.9.1, 0.10.0 > Reporter: Gopal V > Priority: Critical > Labels: correctness > > In case of a task Input throwing an exception, the outputs are also closed in the LogicalIOProcessorRuntimeTask.cleanup(). > Cleanup ignore all the events returned by output close, however if any output tries to send an event out of band by directly calling outputContext.sendEvents(events), then those events can reach the AM before the task failure is reported. > This can cause correctness issues with shuffle since zero sized events can be sent out due to an input failure and downstream tasks may never reattempt a fetch from the valid attempt. -- This message was sent by Atlassian JIRA (v7.6.3#76005)