ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-10133) ML: Switch to per-node TensorFlow worker strategy
Date Fri, 02 Nov 2018 15:25:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673261#comment-16673261

ASF GitHub Bot commented on IGNITE-10133:

GitHub user dmitrievanthony opened a pull request:


    IGNITE-10133: Switch to per-node TensorFlow worker strategy.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gridgain/apache-ignite ignite-10133

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5249
commit 13962c2c13d1cf945cac90ce003831d0a4a4fd33
Author: Anton Dmitriev <dmitrievanthony@...>
Date:   2018-11-02T15:20:52Z

    IGNITE-10133: Switch to per-node TensorFlow worker strategy.


> ML: Switch to per-node TensorFlow worker strategy
> -------------------------------------------------
>                 Key: IGNITE-10133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10133
>             Project: Ignite
>          Issue Type: Improvement
>          Components: ml
>    Affects Versions: 2.8
>            Reporter: Anton Dmitriev
>            Assignee: Anton Dmitriev
>            Priority: Major
>             Fix For: 2.8
> Currently we start TensorFlow worker process per every cache partition. In case node
is equipped by GPU and TensorFlow uses this GPU it acquires all GPU memory. If two worker
processes try to acquire all GPU memory they will fail.
> To eliminate this problem and allow users utilizing GPU during the training we need
to switch to per-node strategy. It means we need to start one TensorFlow worker process per
node, not per partition.

This message was sent by Atlassian JIRA

View raw message