mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: Problem with Jenkins GPU instances?
Date Fri, 04 May 2018 03:15:25 GMT
You're right, it seems like the Docker builds are hanging. I'm testing the
new auto scaling feature on the test environment [1] and I noticed that all
jobs hung at the exact same spot until 2:40AM German time. It seems like
some APT servers were having problems and since apt does not have a timeout
included, it hung the build instead of failing gracefully. It's 05:13AM now
and it seems like my test builds recovered. I'll check the production
environment and see if it's working fine over there as well. I'll give you
an update in here as soon a I know more details.

-Marco

[1]:
http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxnet/job/ci-master/

On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hjjn@amazon.com> wrote:

> Thanks for fixing the servers! However I found that some of the builds are
> taking extremely long time (not even starting after ~2 hrs):
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10645/18/pipeline/59
> Seems like they are stuck during the setup phase?
> Hao
>
> On 5/3/18, 2:44 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
> wrote:
>
>     Alright, we're back up.
>
>     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>     marco.g.abreu@googlemail.com> wrote:
>
>     > Seems like the CI will be down until some other people turn off their
>     > instances...
>     >
>     > Error
>     > We currently do not have sufficient g3.8xlarge capacity in zones with
>     > support for 'gp2' volumes. Our system will be working on provisioning
>     > additional capacity.
>     >
>     > -Marco
>     >
>     >
>     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hjjn@amazon.com> wrote:
>     >
>     >> Thanks a lot Marco!
>     >> Hao
>     >>
>     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com
> >
>     >> wrote:
>     >>
>     >>     Hello,
>     >>
>     >>     I'm already investigating the issue and it seems to be related
> to the
>     >>     recently introduced KVStore tests. They tend to hang, leading
> to job
>     >> be
>     >>     forcefully terminated by Jenkins. The problem here is that this
> does
>     >> not
>     >>     terminate the underlying Docker containers, leading to
> unreleased
>     >> resources.
>     >>
>     >>     As an immediate solution, I will restart all slaves to ensure
> the CI
>     >> is
>     >>     running again. After that, I will try to find a solution to
> detect and
>     >>     release these containers.
>     >>
>     >>     Best regards,
>     >>     Marco
>     >>
>     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hjjn@amazon.com>
> wrote:
>     >>
>     >>     > I’ve encountered 2 failed GPU builds due to “initialization
> error:
>     >> driver
>     >>     > error: failed to process request”, the links to the failed
> builds
>     >> are:
>     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>     >>     >
>     >>     >
>     >>
>     >>
>     >>
>     >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message