mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: Problem with Jenkins GPU instances?
Date Thu, 03 May 2018 21:42:39 GMT
Alright, we're back up.

On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
marco.g.abreu@googlemail.com> wrote:

> Seems like the CI will be down until some other people turn off their
> instances...
>
> Error
> We currently do not have sufficient g3.8xlarge capacity in zones with
> support for 'gp2' volumes. Our system will be working on provisioning
> additional capacity.
>
> -Marco
>
>
> On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hjjn@amazon.com> wrote:
>
>> Thanks a lot Marco!
>> Hao
>>
>> On 5/3/18, 12:02 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
>> wrote:
>>
>>     Hello,
>>
>>     I'm already investigating the issue and it seems to be related to the
>>     recently introduced KVStore tests. They tend to hang, leading to job
>> be
>>     forcefully terminated by Jenkins. The problem here is that this does
>> not
>>     terminate the underlying Docker containers, leading to unreleased
>> resources.
>>
>>     As an immediate solution, I will restart all slaves to ensure the CI
>> is
>>     running again. After that, I will try to find a solution to detect and
>>     release these containers.
>>
>>     Best regards,
>>     Marco
>>
>>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hjjn@amazon.com> wrote:
>>
>>     > I’ve encountered 2 failed GPU builds due to “initialization error:
>> driver
>>     > error: failed to process request”, the links to the failed builds
>> are:
>>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>     > incubator-mxnet/detail/PR-10533/18/pipeline
>>     >
>>     >
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message