Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B0D5D10FBE for ; Tue, 23 Jul 2013 05:02:25 +0000 (UTC) Received: (qmail 83545 invoked by uid 500); 23 Jul 2013 05:02:24 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 83404 invoked by uid 500); 23 Jul 2013 05:02:23 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 83393 invoked by uid 99); 23 Jul 2013 05:02:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 05:02:23 +0000 X-ASF-Spam-Status: No, hits=2.6 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,TRACKER_ID X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates 209.85.214.53 as permitted sender) Received: from [209.85.214.53] (HELO mail-bk0-f53.google.com) (209.85.214.53) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 05:02:15 +0000 Received: by mail-bk0-f53.google.com with SMTP id e11so2741900bkh.26 for ; Mon, 22 Jul 2013 22:01:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=7CiOUqVDNpCGZSt015TNEQ7wxkdg5T4eEAN/6C9K7i4=; b=J7KlnOEv5DdmHz1bI/Ohiz4MJ75PHnMto/kfEAxAUgaNVpqsH+YqPJyoYJCbloSCdD tIs4EEDH8DA1G4biiSArfejTA0+/mkgjqxGGsEk2I86PPfh1NQyVA/p+KWmh6Y52wxAV V5cDrQYs3j7153K1/kX5yDv29ebTqIcaF34t/xPBH9lk1wUNz/G7bHn9lIPILQsLatvV nnMLqFQiTRlmDseZxXsMpIwpTPyzsFhpEzB4qGKPvpq20jHaKxqYaUSbiZN06HHefsh4 UBBENfapXXSxVUzK5GpwazFMWWhP3gMU7NhFyBAMQqmyxozuiV3++O62ZbOMdyDS05oc RGdA== MIME-Version: 1.0 X-Received: by 10.205.34.14 with SMTP id sq14mr4419905bkb.100.1374555715395; Mon, 22 Jul 2013 22:01:55 -0700 (PDT) Sender: saint.ack@gmail.com Received: by 10.204.39.16 with HTTP; Mon, 22 Jul 2013 22:01:55 -0700 (PDT) In-Reply-To: References: Date: Mon, 22 Jul 2013 22:01:55 -0700 X-Google-Sender-Auth: ZOp-nxzTXcJnKcYMXZbXrmzXm20 Message-ID: Subject: Re: Getting unit tests to pass From: Stack To: HBase Dev List Content-Type: multipart/alternative; boundary=bcaec51a7c7a2412dd04e226b111 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec51a7c7a2412dd04e226b111 Content-Type: text/plain; charset=UTF-8 nvm. I read the resourcechecker code. It is just printing out before and afters so my speculation that we are up against fd limits is just off. Back to figuring out why tests fail at random.... St.Ack On Mon, Jul 22, 2013 at 9:50 PM, Stack wrote: > Here is another from tail of > https://issues.apache.org/jira/browse/HBASE-5995 > > 2013-07-23 01:23:29,574 INFO [pool-1-thread-1] > hbase.ResourceChecker(171): after: > regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was > 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor > LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was > 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -, > AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0) > > This one showed up as a zombie too; stuck. > > Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/, > where we'd had a nice run of passing tests, of a sudden a test that I've > not seen fail before, fails: > > https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/ > > > org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK > > Near the end of the test, the resource checker reports: > * > * > > - Thread LEAK? -, OpenFileDescriptor=100 (was 92) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=328 (was 331), ProcessCount=138 (was 138), AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0) > > > > Getting tests to pass on these build boxes (other than hadoopqa which is a > different set of machines) seems unattainable. > > I will write infra about the 40k to see if they can do something about > that. > > St.Ack > > > > > On Mon, Jul 22, 2013 at 9:13 PM, Stack wrote: > >> By way of illustration of how loaded Apache build boxes can be: >> >> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), ConnectionCount=0 (was 0) >> >> This seems to have caused a test that usually passes to fail: >> https://issues.apache.org/jira/browse/HBASE-9023 >> >> St.Ack >> >> >> On Mon, Jul 22, 2013 at 11:49 AM, Stack wrote: >> >>> Below is a state of hbase 0.95/trunk unit tests (Includes a little >>> taxonomy of test failure type definitions). >>> >>> On Andrew's ec2 build box, 0.95 is passing most of the time: >>> >>> http://54.241.6.143/job/HBase-0.95/ >>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/ >>> >>> It is not as good on Apache build box but it is getting better: >>> >>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/ >>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/ >>> >>> On Apache, I have seen loads up in the 500s and all file descriptors >>> used according to the little resources report printed at the end of each >>> test. If these numbers are to be believed (TBD), we may never achieve 100% >>> pass rate on Apache builds. >>> >>> Andrew's ec2 builds run the integration tests too where the apache >>> builds do not -- sometimes we'll fail an integration test run which makes >>> the Andrew ec2 red/green ratio look worse that it actually is. >>> >>> Trunk builds lag. They are being worked on. >>> >>> We seem to be over the worst of the flakey unit tests. We have a few >>> stragglers still but they are being hunted down by the likes of the >>> merciless Jimmy Xiang and Jeffrey Zhong. >>> >>> The "zombies" have been mostly nailed too (where "zombies" are tests >>> that refuse to die continuing after the suite has completed causing the >>> build to fail). The zombie trap from test-patch.sh was ported over to >>> apache and ec2 build and it caught the last of undying. >>> >>> We are now into a new phase where "all" tests pass but the build still >>> fails. Here is an example: >>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of >>> tests run is less than the total that shows for a successful run. >>> >>> Unless anyone has a better idea, to figure why the hang, I compare the >>> list of tests that show in a good run vs. those of a bad run. Tests that >>> are in the good run but missing from the bad run are deemed suspect. In >>> the absence of other evidence or other ideas, I am blaming these >>> "invisibles" for the build fail. >>> >>> Here is an example: >>> >>> This is a good 0.95 hadoop2 run (notice how we are running integration >>> tests tooooo and they succeed!! On hadoop2!!!!): >>> >>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/ >>> >>> In hbase-server module: >>> >>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19 >>> >>> >>> This is a bad run: >>> >>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/ >>> >>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18 >>> >>> >>> If I compare tests, the successful run has: >>> >>> > Running >>> org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed >>> >>> >>> ... where the bad run does not show the above test. >>> TestHLogSplitCompressed has 34 tests one of which is disabled so that >>> would seem to account for the discrepancy. >>> >>> I've started to disable tests that fail likes this putting them aside >>> for original authors or the interested to take a look to see why they fail >>> occasionally. I put them aside so we can enjoy passing builds in the >>> meantime. I've already moved aside or disabled a few tests and test >>> classes: >>> >>> TestMultiTableInputFormat >>> TestReplicationKillSlaveRS >>> TestHCM.testDeleteForZKConnLeak was disabled >>> >>> ... and a few others. >>> >>> Finally (if you are still reading), I would suggest that test failures >>> in hadoopqa are now more worthy of investigation. Illustrative is what >>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections" >>> where the patch had +1s and on its first run, a unit test failed (though it >>> passed locally). The second run obscured the first run's failure. After >>> digging by another, the patch had actually broken the first test (though it >>> looked unrelated). I would suggest that now tests are healthier, test >>> failures are worth paying more attention too. >>> >>> Yours, >>> St.Ack >>> >>> >>> >>> >> > --bcaec51a7c7a2412dd04e226b111--