hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keqiu Hu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-9294) Potential race condition in setting GPU cgroups & execute command in the selected cgroup
Date Wed, 13 Feb 2019 07:47:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766891#comment-16766891

Keqiu Hu commented on YARN-9294:

Confirmed it is a race condition in cgroups creation & executing command in the cgroups.
We plan to go ahead with a safe check between these two privileged operations. Note the same
issue should apply to 3.1+ as well. cc [~wangda] [~tangzhankun]

> Potential race condition in setting GPU cgroups & execute command in the selected
> ----------------------------------------------------------------------------------------
>                 Key: YARN-9294
>                 URL: https://issues.apache.org/jira/browse/YARN-9294
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.10.0
>            Reporter: Keqiu Hu
>            Assignee: Keqiu Hu
>            Priority: Critical
> Environment is latest branch-2 head
> OS: RHEL 7.4
> *Observation*
> Out of ~10 container allocations with GPU requirement, at least 1 of the allocated containers
would lose GPU isolation. Even if I asked for 1 GPU, I could still have visibility to all
GPUs on the same machine when running nvidia-smi.
> The funny thing is even though I have visibility to all GPUs at the moment of executing
container-executor (say ordinal 0,1,2,3), but cgroups jailed the process's access to only
that single GPU after sometime. 
> The underlying process trying to access GPU would take the initial information as source
of truth and try to access physical 0 GPU which is not really available to the process. This
results in a [CUDA_ERROR_INVALID_DEVICE: invalid device ordinal] error.
> Validated the container-executor commands are correct:
> {code:java}
> PrivilegedOperationExecutor command: [/export/apps/hadoop/nodemanager/latest/bin/container-executor,
--module-gpu, --container_id, container_e22_1549663278916_0249_01_000001, --excluded_gpus,
> PrivilegedOperationExecutor command: 
> [/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0, application_1549663278916_0249,
/grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_000001.tokens, /grid/a/tmp/yarn,
/grid/a/tmp/userlogs, /export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m,
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer, khu,
application_1549663278916_0249, container_e22_1549663278916_0249_01_000001, ltx1-hcl7552.grid.linkedin.com,
8040, /grid/a/tmp/yarn]
> {code}
> So most likely a race condition between these two operations? 
> cc [~jhung]
> Another potential theory is the cgroups creation for the container actually failed but
the error was swallowed silently.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message