mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: Problem with Jenkins GPU instances?
Date Fri, 04 May 2018 04:21:05 GMT
Great,
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10533/22/
seems to be passing without problems.

On Fri, May 4, 2018 at 6:07 AM, Jin, Hao <hjjn@amazon.com> wrote:

> The builds are running now, thanks!
>
> On 5/3/18, 8:16 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
> wrote:
>
>     You're right, it seems like the Docker builds are hanging. I'm testing
> the
>     new auto scaling feature on the test environment [1] and I noticed
> that all
>     jobs hung at the exact same spot until 2:40AM German time. It seems
> like
>     some APT servers were having problems and since apt does not have a
> timeout
>     included, it hung the build instead of failing gracefully. It's
> 05:13AM now
>     and it seems like my test builds recovered. I'll check the production
>     environment and see if it's working fine over there as well. I'll give
> you
>     an update in here as soon a I know more details.
>
>     -Marco
>
>     [1]:
>     http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-
> mxnet/job/ci-master/
>
>     On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hjjn@amazon.com> wrote:
>
>     > Thanks for fixing the servers! However I found that some of the
> builds are
>     > taking extremely long time (not even starting after ~2 hrs):
>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>     > incubator-mxnet/detail/PR-10645/18/pipeline/59
>     > Seems like they are stuck during the setup phase?
>     > Hao
>     >
>     > On 5/3/18, 2:44 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
>     > wrote:
>     >
>     >     Alright, we're back up.
>     >
>     >     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>     >     marco.g.abreu@googlemail.com> wrote:
>     >
>     >     > Seems like the CI will be down until some other people turn
> off their
>     >     > instances...
>     >     >
>     >     > Error
>     >     > We currently do not have sufficient g3.8xlarge capacity in
> zones with
>     >     > support for 'gp2' volumes. Our system will be working on
> provisioning
>     >     > additional capacity.
>     >     >
>     >     > -Marco
>     >     >
>     >     >
>     >     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hjjn@amazon.com>
> wrote:
>     >     >
>     >     >> Thanks a lot Marco!
>     >     >> Hao
>     >     >>
>     >     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <
> marco.g.abreu@googlemail.com
>     > >
>     >     >> wrote:
>     >     >>
>     >     >>     Hello,
>     >     >>
>     >     >>     I'm already investigating the issue and it seems to be
> related
>     > to the
>     >     >>     recently introduced KVStore tests. They tend to hang,
> leading
>     > to job
>     >     >> be
>     >     >>     forcefully terminated by Jenkins. The problem here is
> that this
>     > does
>     >     >> not
>     >     >>     terminate the underlying Docker containers, leading to
>     > unreleased
>     >     >> resources.
>     >     >>
>     >     >>     As an immediate solution, I will restart all slaves to
> ensure
>     > the CI
>     >     >> is
>     >     >>     running again. After that, I will try to find a solution
> to
>     > detect and
>     >     >>     release these containers.
>     >     >>
>     >     >>     Best regards,
>     >     >>     Marco
>     >     >>
>     >     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hjjn@amazon.com
> >
>     > wrote:
>     >     >>
>     >     >>     > I’ve encountered 2 failed GPU builds due to
> “initialization
>     > error:
>     >     >> driver
>     >     >>     > error: failed to process request”, the links to the
> failed
>     > builds
>     >     >> are:
>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>     >     >>     >
>     >     >>     >
>     >     >>
>     >     >>
>     >     >>
>     >     >
>     >
>     >
>     >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message