reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1850) IMRU test fails on Yarn with 800+ nodes
Date Thu, 10 Aug 2017 06:58:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121168#comment-16121168
] 

Julia commented on REEF-1850:
-----------------------------

More tests show that it is transient. Sometimes, with 1000 nodes, the test can still pass.
So passing tests doesn't mean there is not issue. But test failure shows there must be some
issue. The common behavior found so far is the driver cannot receive all the completed context
events. Usually missing 1 or 2 out of 1000. 

> IMRU test fails on Yarn with 800+ nodes
> ---------------------------------------
>
>                 Key: REEF-1850
>                 URL: https://issues.apache.org/jira/browse/REEF-1850
>             Project: REEF
>          Issue Type: Bug
>          Components: IMRU
>            Reporter: Taegeon Um
>
> [~juliaw] found that IMRU test fails on Yarn with 800+ nodes. 
> With 500 nodes, test pass.
> With 1000 nodes, test fails. Received 1000 completed tasks but only 998 completed evaluators.
Drive doesn’t shut down until it is killed.
> With 800 nodes, test fails. Received 800 completed tasks but only 799 completed evaluators.
Drive doesn’t shut down until it is killed.
> We need to investigate this scalability issue and find a root cause. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message