From dev-return-4383-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Thu Oct 4 23:18:17 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0CAC4180658 for ; Thu, 4 Oct 2018 23:18:16 +0200 (CEST) Received: (qmail 42684 invoked by uid 500); 4 Oct 2018 21:18:16 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 42667 invoked by uid 99); 4 Oct 2018 21:18:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2018 21:18:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 832FAC28A1 for ; Thu, 4 Oct 2018 21:12:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.888 X-Spam-Level: ** X-Spam-Status: No, score=2.888 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 3X1lNP7SVeAv for ; Thu, 4 Oct 2018 21:12:00 +0000 (UTC) Received: from mail-yb1-f180.google.com (mail-yb1-f180.google.com [209.85.219.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 3398E5F3E2 for ; Thu, 4 Oct 2018 21:12:00 +0000 (UTC) Received: by mail-yb1-f180.google.com with SMTP id u88-v6so4562605ybi.0 for ; Thu, 04 Oct 2018 14:12:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=0/xsVtfSe1A8HW5P29hvoTEWTDso7LBkxVNlmh3x4YE=; b=qzVD4rc00f/5mDRPR0sUZlVTupXhOQ8aVYl12B4zku2MYE7tRHrk6nEEgZYXAQlW6f VUC5vZ0J3J2o8MMak/kZzIVZQEPY+f6rfX+En9S0U8ieIg9yEOHSNo92IdZO31T8PGig tTCeJ5SMMNClSuQDV3d6OviMcP7s70msp9rpcAWlLGeqG8XoQtUSto4t3wJ6G8mAuMDW pDlRKKeZXxX84Y586K7x7YATAVUB5j4xSyH9GLWGBRwqr+557C4PO0gH1daQ82rQRM56 EHek4yzfmJtrVau1JdHe4/NUBFvM4HsYI4zpXv93S3D9evxTHSrh2PZiKe4mcJYkVBBI mEcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=0/xsVtfSe1A8HW5P29hvoTEWTDso7LBkxVNlmh3x4YE=; b=gfEaSl5oJCPoKjrJZR29NDPQpME8o2ve2Z7XyOonVxIbPuX9N8gGot0uy/395wAIC1 HiaF4lprph8/30gWbKvVFirKk3bAOW8fHUu5fRPqPxEkeOlP+00sahZMqoacQ0UUTj+k 9TZFfm3NG8/hoI8g5l9Z2p8wJaR71DhKXlSgjLHcYWXkzHuuLDJJqo382KXW9Y+Qo/LS qAtyaYvkD4dmRdWZtJ4zSUr3gywviGh4cbcpcJrP2FINDwNZrjK/ecAyfETvVIe3ZJL9 88pS87KRj5Tep8VGkWmgz5csc/j8tnomaDc6/HLZfuzrhHbQB2dFnJfXLixz+dDpeRm3 y3cA== X-Gm-Message-State: ABuFfogB9p6W1oiQGyaeSMZ5dmTzkRJQPlO2YNfDwIgWuLg9f4UsYZhV TWVP3fC+pqNC3yLhWtsXjS7NzLQMb2jMEyoW4870DMPG X-Google-Smtp-Source: ACcGV61yp59thCaR7bKEYKAHKXCHi2kr4cfqJd5QtGEdMj9q4bQxot89c6cC6k/sY528GXjKmMDOCsb7cPyLv1tn1qM= X-Received: by 2002:a25:274a:: with SMTP id n71-v6mr4899609ybn.52.1538687519181; Thu, 04 Oct 2018 14:11:59 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: kellen sunderland Date: Thu, 4 Oct 2018 23:11:47 +0200 Message-ID: Subject: Re: CUDNN algorithm selection failure To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="00000000000000519a05776d9b8e" --00000000000000519a05776d9b8e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable "I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't reproduce the issue." One thing to keep in mind is that the SelectAlgo call will cache results in a registry that is in static scope. To repro you'd likely have to create a new process each time you run the test. (Apologies if this is already how you're reproducing). SelectAlgo call: https://github.com/apache/incubator-mxnet/blob/403831ace46eab4447794df94113= 51e439e8983e/src/operator/nn/cudnn/cudnn_convolution-inl.h#L609 Static local / singleton registry pattern here: https://github.com/apache/incubator-mxnet/blob/024b5a916dd3a39a39031ce5e656= 5cd7d9d60fe2/src/operator/nn/cudnn/cudnn_algoreg.cc#L37 On Thu, Oct 4, 2018 at 8:58 PM Marco de Abreu wrote: > For GPU, we don't run any tests in parallel. > > -Marco > > Naveen Swamy schrieb am Do., 4. Okt. 2018, 19:54: > > > Looking at the error raised, you can see that the workspace size(GPU me= m > > size) of 1GB isn't sufficient. I am wondering if it is due to tests > running > > in parallel on CI, if this is true(tests running in parallel) is it > > possible to reduce the parallelism ? > > Error: > > "mxnet.base.MXNetError: [05:40:12] > > src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any > > forward convolution algorithm. with workspace size of 1073741824 bytes= , > > please consider reducing batch/model size or increasing the workspace > size" > > > > I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't > > reproduce the issue. I will look into it further to see if there are > other > > alternatives. > > > > > > On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai > wrote: > > > > > Another build where test_slice_batchnorm_reshape_batchnorm fails : > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubato= r-mxnet/detail/PR-12721/7/pipeline > > > < > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubato= r-mxnet/detail/PR-12721/7/pipeline > > > > > > > > > > =E2=80=94 > > > Piyush > > > > > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy < > pedro.larroy.lists@gmail.com > > > > > > wrote: > > > > > > > > Seems is not the only test: > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubato= r-mxnet/detail/PR-12726/5/pipeline > > > > > > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't > been > > > > touched for a while. It doesn't look like a problem with the test t= o > > me, > > > > (not a flaky test). Looks to me that should find and address the ro= ot > > > cause > > > > instead of disabling the test in this case. > > > > > > > > Pedro. > > > > > > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu > > > > wrote: > > > > > > > >> I have created an issue at > > > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to > > > disable > > > >> the test at https://github.com/apache/incubator-mxnet/pull/12716. > > > >> > > > >> This test is pretty new and was submitted with a number of other > > > >> problematic (and disabled) tests: > > > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be > > > >> possible > > > >> that the test is simply not stable enough. The PR that introduced > that > > > test > > > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was > > merged > > > >> two > > > >> days ago. > > > >> > > > >> Best regards, > > > >> Marco > > > >> > > > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy < > > > pedro.larroy.lists@gmail.com> > > > >> wrote: > > > >> > > > >>> Thanks for checking Lin. If it happens again we will have to dig > > > deeper. > > > >> We > > > >>> have just one executor in GPU so I wonder what could be the root > > cause > > > of > > > >>> this. > > > >>> > > > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan > > wrote: > > > >>> > > > >>>> I could not reproduce the error on an EC2 g3x8 instance making i= t > > hard > > > >> to > > > >>>> debug. I also suspect it was due to resource usage limit on ci > > > >>> Instance. > > > >>>> > > > >>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > > > >>> pedro.larroy.lists@gmail.com > > > >>>>> > > > >>>> wrote: > > > >>>> > > > >>>>> It doesn't look like flakiness to me at first sight. I think it > > might > > > >>> be > > > >>>>> related to resource usage / allocation / leak in the worst case= . > > > >>>>> > > > >>>>> Could be that there was not enough memory GPU memory at the tim= e > of > > > >>> test > > > >>>>> execution. But I'm just speculating, hence my original question= . > > > >>>>> > > > >>>>> Pedro. > > > >>>>> > > > >>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan > > wrote: > > > >>>>> > > > >>>>>> Hi Pedro, > > > >>>>>> > > > >>>>>> I also got this failure in my PR > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubato= r-mxnet/detail/PR-11742/27/pipeline > > > >>>>>> > > > >>>>>> I was not able to identify the root cause of it from changelis= t. > > > >> Are > > > >>>> you > > > >>>>>> suggesting there is some flakiness in the master branch too? > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> > > > >>>>>> Lin > > > >>>>>> > > > >>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > >>>>> pedro.larroy.lists@gmail.com> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Hi > > > >>>>>>> > > > >>>>>>> I saw this failure on CI: > > > >>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubato= r-mxnet/detail/master/1697/pipeline > > > >>>>>>> > > > >>>>>>> Have you seen other cases where we fail to select the best > CUDNN > > > >>>>>> algorithm? > > > >>>>>>> In which circumstances this could happen, and do you think is= a > > > >>> good > > > >>>>> idea > > > >>>>>>> to have one selected by default as a last resort? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Pedro. > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > --00000000000000519a05776d9b8e--