From dev-return-2782-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Fri May 4 06:22:00 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5A062180675 for ; Fri, 4 May 2018 06:21:59 +0200 (CEST) Received: (qmail 29435 invoked by uid 500); 4 May 2018 04:21:57 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 29417 invoked by uid 99); 4 May 2018 04:21:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 May 2018 04:21:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0DBEB1A1F05 for ; Fri, 4 May 2018 04:21:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.899 X-Spam-Level: * X-Spam-Status: No, score=1.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ofpJ9JocSbca for ; Fri, 4 May 2018 04:21:54 +0000 (UTC) Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com [209.85.215.43]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 960AF5FB19 for ; Fri, 4 May 2018 04:21:53 +0000 (UTC) Received: by mail-lf0-f43.google.com with SMTP id w8-v6so29055495lfe.3 for ; Thu, 03 May 2018 21:21:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=J9o8RRh4vjoMxWaRoSs8r6Pg385kmG5oIQhq3a9dYg4=; b=Rm+C9EyYX8pMgOKwbGRSyH2VdtncfjmVQKLO3dShcZ38BtcjBsfZr0oxsxi1c2KH4y c3IngUFkDOA1eH7RSMlhhcPRkS7nxtINANiOYpG3GVRbnlPXCQ+SPb2Y/ojT/hu0oKgM twxDxdXBrxONPvSnHxuBLOj/ONlASzTfyAQBlFuVXmCaPBwIAOnITPbNBoftgG3laj0O A0LqW96ctBX6zTLQ2KWHCwdvwVIBeCr37vZ3/0GN5ld4VqsZUgDAi4qcdJMdAK2+II7E NMrjV3Z0wM5pgz3CcC3m1XzIJziS16ZQaqIi2Ma2ctAncpg9pQVR3YcuszQqjN616NYj qI3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=J9o8RRh4vjoMxWaRoSs8r6Pg385kmG5oIQhq3a9dYg4=; b=q9nyd7Xsrlty7eyDzvRQf8xi1ZsVgnsltSJ72cHZhk4XMv7bOG9XyQabfSNia7RocA B+m8PxjboDn0+7joBICKga1XvzAztx5bR54codML+lSNx4raWUvcuokuaohTK4Hu1cN6 zTFxK/tSXLAfc+1vAvIIGrLCID2iLBUVq/XuSic76qI2I17fUxKBi7MklQJAmUIUZTfx f5rwLt4qIy3MsxLr10D0IKYEyCVT02YWgvfj+NydCjdj1rl3tK1CkBja2PKLM3E2Rl08 O+EyllzvN1ciPxP1lIsMC7nb4XxW+qRh2U88OYcfSpv7RvOYMVn0OSRRq4db2ko7ViXe SdnA== X-Gm-Message-State: ALQs6tDlwGm8VgOczcS9kAZ07JLdaC46hlxkjbY4DXBUDmev3yvoUv2l cLa899ev2V8V02DwciLU+faHeSIsmOpvWvzv6pOpFg== X-Google-Smtp-Source: AB8JxZqsZ9nf1Pr8C2BjmYLXWS8frpA2x3gL5tFammAixyKDQTpZ39Gwneh6rvvNy+tx/Jmw/+eBEPjZE5bjtJh/MXY= X-Received: by 2002:a2e:8150:: with SMTP id t16-v6mr17161990ljg.32.1525407705702; Thu, 03 May 2018 21:21:45 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a19:9a10:0:0:0:0:0 with HTTP; Thu, 3 May 2018 21:21:05 -0700 (PDT) In-Reply-To: <785A7C5A-FCFC-465C-A69D-F56525EBBB6A@amazon.com> References: <728E04F7-7CB2-4235-A97A-6FBF8ED5B8E8@amazon.com> <785A7C5A-FCFC-465C-A69D-F56525EBBB6A@amazon.com> From: Marco de Abreu Date: Fri, 4 May 2018 06:21:05 +0200 Message-ID: Subject: Re: Problem with Jenkins GPU instances? To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="0000000000006fa8e7056b59a8a2" --0000000000006fa8e7056b59a8a2 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Great, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-= mxnet/detail/PR-10533/22/ seems to be passing without problems. On Fri, May 4, 2018 at 6:07 AM, Jin, Hao wrote: > The builds are running now, thanks! > > =EF=BB=BFOn 5/3/18, 8:16 PM, "Marco de Abreu" > wrote: > > You're right, it seems like the Docker builds are hanging. I'm testin= g > the > new auto scaling feature on the test environment [1] and I noticed > that all > jobs hung at the exact same spot until 2:40AM German time. It seems > like > some APT servers were having problems and since apt does not have a > timeout > included, it hung the build instead of failing gracefully. It's > 05:13AM now > and it seems like my test builds recovered. I'll check the production > environment and see if it's working fine over there as well. I'll giv= e > you > an update in here as soon a I know more details. > > -Marco > > [1]: > http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator- > mxnet/job/ci-master/ > > On Fri, May 4, 2018 at 2:59 AM, Jin, Hao wrote: > > > Thanks for fixing the servers! However I found that some of the > builds are > > taking extremely long time (not even starting after ~2 hrs): > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > incubator-mxnet/detail/PR-10645/18/pipeline/59 > > Seems like they are stuck during the setup phase? > > Hao > > > > On 5/3/18, 2:44 PM, "Marco de Abreu" > > wrote: > > > > Alright, we're back up. > > > > On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu < > > marco.g.abreu@googlemail.com> wrote: > > > > > Seems like the CI will be down until some other people turn > off their > > > instances... > > > > > > Error > > > We currently do not have sufficient g3.8xlarge capacity in > zones with > > > support for 'gp2' volumes. Our system will be working on > provisioning > > > additional capacity. > > > > > > -Marco > > > > > > > > > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao > wrote: > > > > > >> Thanks a lot Marco! > > >> Hao > > >> > > >> On 5/3/18, 12:02 PM, "Marco de Abreu" < > marco.g.abreu@googlemail.com > > > > > >> wrote: > > >> > > >> Hello, > > >> > > >> I'm already investigating the issue and it seems to be > related > > to the > > >> recently introduced KVStore tests. They tend to hang, > leading > > to job > > >> be > > >> forcefully terminated by Jenkins. The problem here is > that this > > does > > >> not > > >> terminate the underlying Docker containers, leading to > > unreleased > > >> resources. > > >> > > >> As an immediate solution, I will restart all slaves to > ensure > > the CI > > >> is > > >> running again. After that, I will try to find a solution > to > > detect and > > >> release these containers. > > >> > > >> Best regards, > > >> Marco > > >> > > >> On Thu, May 3, 2018 at 8:55 PM, Jin, Hao > > > wrote: > > >> > > >> > I=E2=80=99ve encountered 2 failed GPU builds due to > =E2=80=9Cinitialization > > error: > > >> driver > > >> > error: failed to process request=E2=80=9D, the links t= o the > failed > > builds > > >> are: > > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > >> > incubator-mxnet/detail/PR-10645/17/pipeline/674 > > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > >> > incubator-mxnet/detail/PR-10533/18/pipeline > > >> > > > >> > > > >> > > >> > > >> > > > > > > > > > > > > --0000000000006fa8e7056b59a8a2--