Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0275D17CD1 for ; Wed, 29 Apr 2015 16:28:58 +0000 (UTC) Received: (qmail 64751 invoked by uid 500); 29 Apr 2015 16:28:57 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 64699 invoked by uid 500); 29 Apr 2015 16:28:57 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 64687 invoked by uid 99); 29 Apr 2015 16:28:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2015 16:28:57 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=HTML_MESSAGE,PLING_QUERY,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.191.145.13 which is an MX secondary for dev@drill.apache.org) Received: from [54.191.145.13] (HELO mx1-us-west.apache.org) (54.191.145.13) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2015 16:28:52 +0000 Received: from mail-yh0-f53.google.com (mail-yh0-f53.google.com [209.85.213.53]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 1449F2497F for ; Wed, 29 Apr 2015 16:28:32 +0000 (UTC) Received: by yhrr66 with SMTP id r66so6811848yhr.3 for ; Wed, 29 Apr 2015 09:28:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=3biG9Bc41N0BS+cJVK535LYPOm3Vi2pgtEq9pGfoqaw=; b=PoOIHYPdmm0Pc/VdANsBILTj20FxcJ1ketoEig6qupudyzymrmUSy3VKpNNQZF00EG 5UIgHhvNlukE2ZAAJs+BlJx4vGPcUnxIRFFqVLDjajfmblj3EfMUpcFzoqu079j2oeg/ l/OnaK3GH1jhzVvBr30Tc2leMum+2zHkxV9NF5ENLWFJ+oq7pgND2pOYYyZ1pfCjvDYp Y/TzPHReFPRdPCYOT+us5yBIomdd9VCHD+9cB11chyYW1kzQteIugcUYDIcRyFqzPcLq 77VFHr/TXUTQs8jAKIMk5xaTmin7Uq2Oqe5st8A6f59xH4GUz/DybIQkOEvNXPuGWR9V DQFQ== X-Gm-Message-State: ALoCoQn46C/AfSprY+ruh0zYLvA6266QlGPMCJNHlglLZL6JTCVKzE2/F6d0cK/zTKXKacmoqMAq MIME-Version: 1.0 X-Received: by 10.170.173.198 with SMTP id p189mr12385865ykd.58.1430324904824; Wed, 29 Apr 2015 09:28:24 -0700 (PDT) Received: by 10.129.95.135 with HTTP; Wed, 29 Apr 2015 09:28:24 -0700 (PDT) In-Reply-To: References: <55408487.7050106@maprtech.com> <269638FA-9B80-4F81-A0AA-A8C48B32BFDA@maprtech.com> Date: Wed, 29 Apr 2015 09:28:24 -0700 Message-ID: Subject: Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads! From: Abdel Hakim Deneche To: "dev@drill.apache.org" Content-Type: multipart/alternative; boundary=001a113b50fedde2ea0514df7709 X-Virus-Checked: Checked by ClamAV on apache.org --001a113b50fedde2ea0514df7709 Content-Type: text/plain; charset=UTF-8 On Wed, Apr 29, 2015 at 9:15 AM, Jacques Nadeau wrote: > Quick question re 10 runs: are these runs that are in parallel with all the > unit tests or just this test? > > The other question is: how do we construct these tests so they it is > extremely unlikely to get a failure even if processing is slow or threads > are suspended? > First problems we hit when processing is slow are junit timeouts. Once a unit tests times out, it's corresponding query isn't cancelled and may continue running in parallel with other unit tests from same test class. Once the @AfterClass method shuts down the drillbits, they may complain about allocators not closed because some queries are actually still running. > On Wed, Apr 29, 2015 at 7:53 AM, Sudheesh Katkam > wrote: > > > I am responsible for those tests. I ran the tests at least 10 times on my > > Linux VM with 1 second pauses, all of which passed. > > > > On your second run, what different errors did you see? > > > > On your third run, are you able to reproduce the test case the hangs? > > > > Sorry that the message is not informative. I already have a patch which > is > > a slight improvement to Jacques change that improves the message in those > > tests. > > > > What tool did you use to get the thread count? > > > > - Sudheesh > > > > Sent from my iPhone. Pardon any typos. > > > > > On Apr 29, 2015, at 6:28 AM, Abdel Hakim Deneche < > adeneche@maprtech.com> > > wrote: > > > > > > The message displayed in the first run contains actually two different > > > issues: > > > > > > 1. The error message "Error shutting down Drillbit 'beta'" is most > likely > > > caused by this issue DRILL-2878 > > > > > > > > > 2. The test that failed with an "java.lang.AssertionError: null" is > most > > > likely a bug because that unit test should not fail. I've seen this > error > > > before, but it only happens intermittently. > > > > > > The system error reported in the 3rd run is actually an "expected" > > injected > > > exception, but 278 threads looks suspicious!!! > > > > > > On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay < > dbarclay@maprtech.com> > > > wrote: > > > > > >> Does anyone know what's going on with TestDrillbitResilience (rebased > > >> from master today)? (Is it working right?) > > >> > > >> > > >> One run, via "mvn install", yielded assertion errors: > > >> > > >> ... > > >> Error shutting down Drillbit "beta". > > >> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: > 33.811 > > >> sec <<< FAILURE! - in > > org.apache.drill.exec.server.TestDrillbitResilience > > >> > > > cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience) > > >> Time elapsed: 1.468 sec <<< FAILURE! > > >> java.lang.AssertionError: null > > >> at > > >> > > > org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459) > > >> at > > >> > > > org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565) > > >> > > >> > > > cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience) > > >> Time elapsed: 1.496 sec <<< FAILURE! > > >> java.lang.AssertionError: null > > >> at > > >> > > > org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459) > > >> at > > >> > > > org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510) > > >> > > >> Running > > >> ... > > >> > > >> > > >> A second run, run individually (but still via Maven) died with > different > > >> errors. > > >> > > >> > > >> > > >> A third run, via "mvn install" again, seems hung after reporting this > > >> (maybe expected) exception: > > >> > > >> Exception (no rows returned): > > >> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: > > >> run-try-end > > >> > > >> > > >> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010] > > >> > > >> > > >> The process is using only about 5% CPU--but has 278 threads! > > >> (That includes about 35 threads all with the same name of > > "BitClient-1".) > > >> > > >> > > >> Daniel > > >> > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> Daniel Barclay > > >> MapR Technologies > > > > > > > > > > > > -- > > > > > > Abdelhakim Deneche > > > > > > Software Engineer > > > > > > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > < > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > -- Abdelhakim Deneche Software Engineer Now Available - Free Hadoop On-Demand Training --001a113b50fedde2ea0514df7709--