hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhiyuan Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5436) Race in AsyncDispatcher can cause random test failures in Tez(probably YARN also )
Date Thu, 28 Jul 2016 16:12:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397739#comment-15397739

Zhiyuan Yang commented on YARN-5436:

[~rohithsharma] Thanks for reviewing! You are right in the sense this patch is mostly letting
DrainDispatcher not reuse AsyncDispatcher's drained field, but the fix for YARN-2991 is still

bq. does small tiny race is causing TEZ test failures?
Yes. In Tez UT tests, invocation of dispatcher.await() finished without handling all events
and assertion after dispatcher.await() failed. This race condition only happens when queue
is almost empty, which is exactly the case in Tez UT tests.

bq. If so would it be good to fix in AsyncDispatcher rather adding full duplicate code. 
The root cause of race is we cannot guarantee we enqueue event and update drained atomically.
I didn't find a way to fix this without adding more synchronization which is a very expensive
fix for a minimum benefit. YARN-3878 discussed about this race and decided to ignore it for
the same reason.  

bq. How about adding additional check before adding into event queue to avoid a race?
While this may avoid enqueuing last event, race can still happen without invoking dispatcher.serviceStop().
Actually in Tez UT test, we never invoke dispatcher.serviceStop().

> Race in AsyncDispatcher can cause random test failures in Tez(probably YARN also )
> ----------------------------------------------------------------------------------
>                 Key: YARN-5436
>                 URL: https://issues.apache.org/jira/browse/YARN-5436
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Zhiyuan Yang
>            Assignee: Zhiyuan Yang
>         Attachments: YARN-5436.1.patch, YARN-5436.2.patch, YARN-5436.3.patch, YARN-5436.4.patch
> In YARN-2264, a race in DrainDispatcher was fixed. Unfortunately, it also exists in AsyncDispatcher
(this was found and ignored in YARN-3878 but never documented...). In YARN-2991, another DrainDispatcher
bug was fixed by letting DrainDispatcher reuse some AsyncDispatcher method because AsyncDispatcher
doesn't have such issue. However, this shadows YARN-2264, and now similar race reappears in
Tez unit tests (probably also YARN unit tests also).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message