mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jin, Hao" <h...@amazon.com>
Subject Re: Problem with Jenkins GPU instances?
Date Fri, 04 May 2018 00:59:25 GMT
Thanks for fixing the servers! However I found that some of the builds are taking extremely
long time (not even starting after ~2 hrs):
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10645/18/pipeline/59
Seems like they are stuck during the setup phase?
Hao

On 5/3/18, 2:44 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com> wrote:

    Alright, we're back up.
    
    On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
    marco.g.abreu@googlemail.com> wrote:
    
    > Seems like the CI will be down until some other people turn off their
    > instances...
    >
    > Error
    > We currently do not have sufficient g3.8xlarge capacity in zones with
    > support for 'gp2' volumes. Our system will be working on provisioning
    > additional capacity.
    >
    > -Marco
    >
    >
    > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hjjn@amazon.com> wrote:
    >
    >> Thanks a lot Marco!
    >> Hao
    >>
    >> On 5/3/18, 12:02 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
    >> wrote:
    >>
    >>     Hello,
    >>
    >>     I'm already investigating the issue and it seems to be related to the
    >>     recently introduced KVStore tests. They tend to hang, leading to job
    >> be
    >>     forcefully terminated by Jenkins. The problem here is that this does
    >> not
    >>     terminate the underlying Docker containers, leading to unreleased
    >> resources.
    >>
    >>     As an immediate solution, I will restart all slaves to ensure the CI
    >> is
    >>     running again. After that, I will try to find a solution to detect and
    >>     release these containers.
    >>
    >>     Best regards,
    >>     Marco
    >>
    >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hjjn@amazon.com> wrote:
    >>
    >>     > I’ve encountered 2 failed GPU builds due to “initialization error:
    >> driver
    >>     > error: failed to process request”, the links to the failed builds
    >> are:
    >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
    >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    >>     > incubator-mxnet/detail/PR-10533/18/pipeline
    >>     >
    >>     >
    >>
    >>
    >>
    >
    

Mime
View raw message