singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ngin Yun Chuan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SINGA-435) Rafiki--Can't create a train job with 'ENABLE_GPU'
Date Mon, 25 Mar 2019 10:52:00 GMT

    [ https://issues.apache.org/jira/browse/SINGA-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800572#comment-16800572
] 

Ngin Yun Chuan commented on SINGA-435:
--------------------------------------

Hi Liu Hui,

Have you followed the instructions on https://nginyc.github.io/rafiki/docs/latest/docs/src/dev/setup.html#scaling-rafiki?
Specifically, remember to do step 7 that adds a "GPU" tag to a node.

Let me know if there's any other issues!

Yun Chuan


> Rafiki--Can't create a train job with 'ENABLE_GPU'
> --------------------------------------------------
>
>                 Key: SINGA-435
>                 URL: https://issues.apache.org/jira/browse/SINGA-435
>             Project: Singa
>          Issue Type: Bug
>            Reporter: Liu Hui
>            Priority: Major
>         Attachments: rafiki_admin001.png
>
>
> >>https://nginyc.github.io/rafiki/docs/latest/docs/src/user/quickstart.html
> I followed the quickstart and tried to create a train job with using GPU。
> So I changed parameters to "budget=\{'ENABLE_GPU':1, 'MODEL_TRIAL_COUNT': 2 }" .when
I create a train job.
> But the container of rafiki_worker didn't start.
> I entered the container of rafiki_admin, and found an error in log file.
> Finally I found that, in rafiki/rafiki/container/docker_swarm.py, the function of _if_any_node_has_gpu
always return False.
> I doubt that what should I do to do a training with GPU in container. Which steps have
I missed, setting up docker's environment or others?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message