reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (REEF-1895) REEF Bridge performance improvement for allocated evaluators
Date Tue, 10 Oct 2017 00:20:00 GMT

     [ https://issues.apache.org/jira/browse/REEF-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julia updated REEF-1895:
------------------------
    Description: 
Recent scale tests show there are a few places in the REEF code, mainly in bridge code that
seriously impact the REEF performance and scalibility. Notably:

-Syncronized(this) in BridgeDriver in event handlers, especially Allocated Evaluator handlers.
That make the events are handled in sequence. When requesting a few thousands evaluators,
the slowness is dramatic. 
-A lock on Evaluators when receiving allocated evaluator in bridge, that increases the execution
time in minutes level. And the matching logic in this code is not used at all. 
-Some variables can be reused but they are computed for each evaluator especially cross bridge
calls. When the number of evaluators reaches to a few thousands, the time spent is obvious.

After an evaluator is allocated, if YARN doesn't receive launch command within time out time,
it will throw failed evaluator. With the current code, we can not even launch two thousand
containers before timeout from .Net side.

This JIRA is to make improvement for allocated evaluators so that to increase the scalability.



  was:
Recent scale tests shows there are a few places in the REEF code, mainly in bridge code that
seriously impact the REEF performance and scalibility. Notably:

-Syncronized(this) in BridgeDriver in event handlers, especially Allocated Evaluator handlers.
That make the events are handled in sequence. When requesting a few thousands evaluators,
the slowness is dramatic. 
-A lock on Evaluators when receiving allocated evaluator in bridge, that increases the execution
time in minutes level. And the matching logic in this code is not used at all. 
-Some variables can be reused but they are computed for each evaluator especially cross bridge
calls. When the number of evaluators reaches to a few thousands, the time spent is obvious.

After an evaluator is allocated, if YARN doesn't receive launch command within time out time,
it will throw failed evaluator. With the current code, we can not even launch two thousand
containers before timeout from .Net side.

This JIRA is to make improvement for allocated evaluators so that to increase the scalability.




> REEF Bridge performance improvement for allocated evaluators
> ------------------------------------------------------------
>
>                 Key: REEF-1895
>                 URL: https://issues.apache.org/jira/browse/REEF-1895
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF, REEF Bridge
>            Reporter: Julia
>            Assignee: Julia
>
> Recent scale tests show there are a few places in the REEF code, mainly in bridge code
that seriously impact the REEF performance and scalibility. Notably:
> -Syncronized(this) in BridgeDriver in event handlers, especially Allocated Evaluator
handlers. That make the events are handled in sequence. When requesting a few thousands evaluators,
the slowness is dramatic. 
> -A lock on Evaluators when receiving allocated evaluator in bridge, that increases the
execution time in minutes level. And the matching logic in this code is not used at all. 
> -Some variables can be reused but they are computed for each evaluator especially cross
bridge calls. When the number of evaluators reaches to a few thousands, the time spent is
obvious.
> After an evaluator is allocated, if YARN doesn't receive launch command within time out
time, it will throw failed evaluator. With the current code, we can not even launch two thousand
containers before timeout from .Net side.
> This JIRA is to make improvement for allocated evaluators so that to increase the scalability.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message