From dev-return-2753-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Thu May 3 16:02:38 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4519D180625 for ; Thu, 3 May 2018 16:02:38 +0200 (CEST) Received: (qmail 21999 invoked by uid 500); 3 May 2018 14:02:37 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 21979 invoked by uid 99); 3 May 2018 14:02:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 May 2018 14:02:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3B8BCC0553 for ; Thu, 3 May 2018 14:02:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.899 X-Spam-Level: * X-Spam-Status: No, score=1.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id f6ZrFwZPn_vw for ; Thu, 3 May 2018 14:02:30 +0000 (UTC) Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com [209.85.220.178]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 3955B5FB4C for ; Thu, 3 May 2018 14:02:30 +0000 (UTC) Received: by mail-qk0-f178.google.com with SMTP id d74so13979405qkg.4 for ; Thu, 03 May 2018 07:02:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=9xCPy5GRW57CJjSZDzpbvUcDjwPpahNO3uR2XJBhw3o=; b=CgpJg7P9g3ZmwsQfF1XXTtB/21VK8RfVVhBiJY/MFjl77RV6GC4GSrXFfl73SFRZCT XmvT45FAvLLUfQBXOYEjB/Z2YG9MUYTm56uPAnZO1xfkZ10bWoPlhtv5i23gYD8KmRxi uiWOPIwHQs6tD2pwVhvwTZZ9E22WBjRzsQJ+o0BYHGyxrNTjhN/mp+J9wCmrBw/B2kH/ K08FIWdbR8KxTHemyQ2IHeJhxmAI022Z1W+62yhC+3fC3mg7XCdFQVFB8C2rw51dAlDq ELZOuN7QHNeuSI1i+prrdVlB/zy/3m+8rbtAtvJtsYJs/Y63MLamjukUGz+SMViBbcQy u0Vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=9xCPy5GRW57CJjSZDzpbvUcDjwPpahNO3uR2XJBhw3o=; b=Gz0spCPAn+iHFVYm0jdKJIwK74Nyh2MX3wdmbKRpA4kniLG7yNqpLwyUg8B92CLuRi skSPZ2+7X5Enec4F/Q1hcZHAY2b8+iwXSefVhGDO2ErI+tAPQq8TmPOb3wWQ8ctOUlFw O5yuAVXD9dz/LrmsDenQNEIDQQgcrelHdtJMy9+mEI9H3+NNkNFjkoD25B0VFWaLvi9+ oxa8e9FxmCg10OhcmpQjQu0jbEgXzaFapWDA+HX4GZWo3LN5fiAI3jlLv2Nz+b1hgIYG WGsj6cPWIYtSHLIolCU7efRJDubNy+ljzWu/C8t99ylxMUHXtKr9kntyWKnamASvbo7Y dF5g== X-Gm-Message-State: ALQs6tDAe2tlQt9BK8IXBS/r39XFuwr2X7WNywOsgQu28g/bU6qdFIvk o32Ft2kxxi478bbeyQEd+9mhVJKI9uRE6eoW4kSl5g== X-Google-Smtp-Source: AB8JxZqYu3HNUSSaqh3XWMZ1QZtvOf2kXOLfo3U2w7UMcB0+oEMbfeBHKjcU8staXbanHSouP7ET2Bo9XabGJ6cY4uM= X-Received: by 10.55.22.41 with SMTP id g41mr13735378qkh.172.1525356143795; Thu, 03 May 2018 07:02:23 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bhavin Thaker Date: Thu, 03 May 2018 14:02:12 +0000 Message-ID: Subject: Re: segmentation fault in master using mkdlnn To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="001a11470e521b6025056b4da786" --001a11470e521b6025056b4da786 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Pedro, All, 1) I would suggest that we run =E2=80=9Culimit -c unlimited=E2=80=9D in eve= ry CI Slave machine at startup to enable core-dump and get stack trace. 2) Valgrind on Python generates so much noise that extracting signal from it is painful, but it is still worth trying it out and look at the messages towards the end when the crash happens. Valgrind on a one-liner python code generates noise and this demonstrates that python itself is not Valgrind-clean. 3) If there are C++ APIs to trigger the same functionality as the current problematic use-case, then one could write a small program to reproduce the crash and then use Valgrind to get to the culprit portion of the code quickly. Bhavin Thaker. On Thu, May 3, 2018 at 6:49 AM Pedro Larroy wrote: > It's very difficult to reproduce, non-deterministic. We were also running > without signal handlers in CI so there are no stack traces unfortunately. > > Care to elaborate why valgrind doesn't work with Python? > > > > On Thu, May 3, 2018 at 3:32 PM, Da Zheng wrote: > > > can we build it in CI=EF=BC=9Fsegfault doesn't happen infrequently. > > > > 2018=E5=B9=B45=E6=9C=882=E6=97=A5 =E4=B8=8B=E5=8D=8811:34=EF=BC=8C"Chri= s Olivier" =E5=86=99=E9=81=93=EF=BC=9A > > > > > you can try Intel Inspector, which is like an enhanced version of > > valgrind > > > with a GUI and whatnot. > > > > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng wrote= : > > > > > > > valgrind doesn't work with Python. also, valgrind doesn't support > some > > > > CPU instructions used by MXNet (I think some instructions related t= o > > > > random generator). > > > > > > > > > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker < > bhavinthaker@gmail.com> > > > > wrote: > > > > > Have you tried running with valgrind to get some clues on the > > > root-cause? > > > > > > > > > > Bhavin Thaker. > > > > > > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng > > wrote: > > > > > > > > > >> It might also be possible that this isn't an MKLDNN bug. > > > > >> I just saw a similar memory error without MKLDNN build. > > > > >> > > > > >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > > incubator-mxnet/detail/PR-10783/1/pipeline > > > > >> > > > > >> Best, > > > > >> Da > > > > >> > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da > > wrote: > > > > >> > There might be a race condition that causes the memory error. > > > > >> > It might be caused by this PR: > > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files > > > > >> > This PR removes MKLDNN memory from NDArray. > > > > >> > However, I don't know why this causes memory error. If someone > is > > > > using > > > > >> the memory, it should still hold the memory with shared pointer. > > > > >> > But I do see the memory error increase after this PR is merged= . > > > > >> > > > > > >> > Best, > > > > >> > Da > > > > >> > > > > > >> > =EF=BB=BFOn 5/2/18, 12:26 PM, "Pedro Larroy" < > > pedro.larroy.lists@gmail.com> > > > > >> wrote: > > > > >> > > > > > >> > I couldn't reproduce locally with: > > > > >> > > > > > >> > ci/build.py -p ubuntu_cpu /work/runtime_functions.sh > > > > >> > build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_c= pu > > > > >> > /work/runtime_functions.sh unittest_ubuntu_python2_cpu > > > > >> > > > > > >> > > > > > >> > On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy < > > > > >> pedro.larroy.lists@gmail.com> > > > > >> > wrote: > > > > >> > > > > > >> > > Hi > > > > >> > > > > > > >> > > Seems master is not running anymore, there's a > segmentation > > > > fault > > > > >> using > > > > >> > > MKDLNN-CPU > > > > >> > > > > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > > > >> > > incubator-mxnet/detail/master/801/pipeline/662 > > > > >> > > > > > > >> > > > > > > >> > > I see my PRs failing with a similar error. > > > > >> > > > > > > >> > > Pedro > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > --001a11470e521b6025056b4da786--