reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taegeon Um (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (REEF-1850) IMRU test fails on Yarn with 800+ nodes
Date Sun, 06 Aug 2017 13:09:02 GMT

     [ https://issues.apache.org/jira/browse/REEF-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Taegeon Um updated REEF-1850:
-----------------------------
    Description: 
[~juliaw] found that IMRU test fails on Yarn with 800+ nodes. 

With 500 nodes, test pass.
With 1000 nodes, test fails. Received 1000 completed tasks but only 998 completed evaluators.
Drive doesn’t shut down until it is killed.
With 800 nodes, test fails. Received 800 completed tasks but only 799 completed evaluators.
Drive doesn’t shut down until it is killed.

We need to investigate this scalability issue and find a root cause. 

  was:
>From [~juliaw]'s experiments, we've found that IMRU test fails on Yarn with 800+ nodes.


With 500 nodes, test pass.
With 1000 nodes, test fails. Received 1000 completed tasks but only 998 completed evaluators.
Drive doesn’t shut down until it is killed.
With 800 nodes, test fails. Received 800 completed tasks but only 799 completed evaluators.
Drive doesn’t shut down until it is killed.

We need to investigate this scalability issue and find a root cause. 


> IMRU test fails on Yarn with 800+ nodes
> ---------------------------------------
>
>                 Key: REEF-1850
>                 URL: https://issues.apache.org/jira/browse/REEF-1850
>             Project: REEF
>          Issue Type: Bug
>          Components: IMRU
>            Reporter: Taegeon Um
>
> [~juliaw] found that IMRU test fails on Yarn with 800+ nodes. 
> With 500 nodes, test pass.
> With 1000 nodes, test fails. Received 1000 completed tasks but only 998 completed evaluators.
Drive doesn’t shut down until it is killed.
> With 800 nodes, test fails. Received 800 completed tasks but only 799 completed evaluators.
Drive doesn’t shut down until it is killed.
> We need to investigate this scalability issue and find a root cause. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message