mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: Problem with Jenkins GPU instances?
Date Fri, 04 May 2018 04:21:46 GMT
Sorry for the inconvenience. If there are any further issues, please let me
know.

Best regards,
Marco

On Fri, May 4, 2018 at 6:21 AM, Marco de Abreu <marco.g.abreu@googlemail.com
> wrote:

> Great, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10533/22/ seems to be passing without problems.
>
> On Fri, May 4, 2018 at 6:07 AM, Jin, Hao <hjjn@amazon.com> wrote:
>
>> The builds are running now, thanks!
>>
>> On 5/3/18, 8:16 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
>> wrote:
>>
>>     You're right, it seems like the Docker builds are hanging. I'm
>> testing the
>>     new auto scaling feature on the test environment [1] and I noticed
>> that all
>>     jobs hung at the exact same spot until 2:40AM German time. It seems
>> like
>>     some APT servers were having problems and since apt does not have a
>> timeout
>>     included, it hung the build instead of failing gracefully. It's
>> 05:13AM now
>>     and it seems like my test builds recovered. I'll check the production
>>     environment and see if it's working fine over there as well. I'll
>> give you
>>     an update in here as soon a I know more details.
>>
>>     -Marco
>>
>>     [1]:
>>     http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxne
>> t/job/ci-master/
>>
>>     On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hjjn@amazon.com> wrote:
>>
>>     > Thanks for fixing the servers! However I found that some of the
>> builds are
>>     > taking extremely long time (not even starting after ~2 hrs):
>>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>     > incubator-mxnet/detail/PR-10645/18/pipeline/59
>>     > Seems like they are stuck during the setup phase?
>>     > Hao
>>     >
>>     > On 5/3/18, 2:44 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com>
>>     > wrote:
>>     >
>>     >     Alright, we're back up.
>>     >
>>     >     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>>     >     marco.g.abreu@googlemail.com> wrote:
>>     >
>>     >     > Seems like the CI will be down until some other people turn
>> off their
>>     >     > instances...
>>     >     >
>>     >     > Error
>>     >     > We currently do not have sufficient g3.8xlarge capacity in
>> zones with
>>     >     > support for 'gp2' volumes. Our system will be working on
>> provisioning
>>     >     > additional capacity.
>>     >     >
>>     >     > -Marco
>>     >     >
>>     >     >
>>     >     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hjjn@amazon.com>
>> wrote:
>>     >     >
>>     >     >> Thanks a lot Marco!
>>     >     >> Hao
>>     >     >>
>>     >     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <
>> marco.g.abreu@googlemail.com
>>     > >
>>     >     >> wrote:
>>     >     >>
>>     >     >>     Hello,
>>     >     >>
>>     >     >>     I'm already investigating the issue and it seems to be
>> related
>>     > to the
>>     >     >>     recently introduced KVStore tests. They tend to hang,
>> leading
>>     > to job
>>     >     >> be
>>     >     >>     forcefully terminated by Jenkins. The problem here is
>> that this
>>     > does
>>     >     >> not
>>     >     >>     terminate the underlying Docker containers, leading to
>>     > unreleased
>>     >     >> resources.
>>     >     >>
>>     >     >>     As an immediate solution, I will restart all slaves to
>> ensure
>>     > the CI
>>     >     >> is
>>     >     >>     running again. After that, I will try to find a solution
>> to
>>     > detect and
>>     >     >>     release these containers.
>>     >     >>
>>     >     >>     Best regards,
>>     >     >>     Marco
>>     >     >>
>>     >     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <
>> hjjn@amazon.com>
>>     > wrote:
>>     >     >>
>>     >     >>     > I’ve encountered 2 failed GPU builds due to
>> “initialization
>>     > error:
>>     >     >> driver
>>     >     >>     > error: failed to process request”, the links to
the
>> failed
>>     > builds
>>     >     >> are:
>>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>>     >     >>     >
>>     >     >>     >
>>     >     >>
>>     >     >>
>>     >     >>
>>     >     >
>>     >
>>     >
>>     >
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message