tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-3984) Shuffle: Out of Band DME event sending causes errors
Date Mon, 27 Aug 2018 22:38:00 GMT

    [ https://issues.apache.org/jira/browse/TEZ-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594304#comment-16594304
] 

Gopal V commented on TEZ-3984:
------------------------------

Specific sequence of events is - input throws exception.

{code}
2018-08-27T17:25:15,579  WARN [TezTR-437616_7273_9_0_0_0 (1520459437616_7273_9_00_000000_0)]
runtime.LogicalIOProcessorRuntimeTask: Ignoring exception when closing input calls(cleanup).
Exception class=java.io.IOException, message ...
{code}

Output gets closed for memory recovery 

{code}
2018-08-27T17:25:15,579  INFO [TezTR-437616_7273_9_0_0_0 (1520459437616_7273_9_00_000000_0)]
impl.PipelinedSorter: Reducer 2: Starting flush of map output
{code}

Sorter pushes event to the output context directly

{code}
2018-08-27T17:25:15,990  INFO [TezTR-437616_7273_9_0_0_0 (1520459437616_7273_9_00_000000_0)]
impl.PipelinedSorter: Reducer 2: Adding spill event for spill (final update=true), spillId=0
{code}

And the Reducer 2 gets the event routed to it.

> Shuffle: Out of Band DME event sending causes errors
> ----------------------------------------------------
>
>                 Key: TEZ-3984
>                 URL: https://issues.apache.org/jira/browse/TEZ-3984
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.4, 0.9.1, 0.10.0
>            Reporter: Gopal V
>            Priority: Critical
>              Labels: correctness
>
> In case of a task Input throwing an exception, the outputs are also closed in the LogicalIOProcessorRuntimeTask.cleanup().
> Cleanup ignore all the events returned by output close, however if any output tries to
send an event out of band by directly calling outputContext.sendEvents(events), then those
events can reach the AM before the task failure is reported.
> This can cause correctness issues with shuffle since zero sized events can be sent out
due to an input failure and downstream tasks may never reattempt a fetch from the valid attempt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message