hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.
Date Thu, 14 Jun 2018 19:04:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512889#comment-16512889
] 

Wangda Tan commented on YARN-8423:
----------------------------------

Thanks [~shanekumpf@gmail.com], I saw this error happens outside of Docker in Docker, it may
cause by the same issue. We need the proposed workaround fix in any case to make code to be
more robust.

> GPU does not get released even though the application gets killed.
> ------------------------------------------------------------------
>
>                 Key: YARN-8423
>                 URL: https://issues.apache.org/jira/browse/YARN-8423
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Sumana Sathish
>            Assignee: Sunil Govindan
>            Priority: Critical
>         Attachments: YARN-8423.001.patch, kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not being released.
> {code}
>  curl -i <NM>/ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":"<version>"},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_<containerID>"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message