hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Francke <lars.fran...@gmail.com>
Subject Re: Getting unit tests to pass
Date Tue, 23 Jul 2013 06:54:21 GMT
Slightly related, sorry for hijacking: I can't get HBase trunk to
build. In particular TestHCM.testClusterStatus always fails for me. I
tried on my own Jenkins as well as my IDE (IntelliJ) with the same
result (two different machines, CentOS & Mac OS).

mvn -U -PrunAllTests -Dmaven.test.redirectTestOutputToFile=true
-Dit.test=noItTest clean install
<http://pastebin.com/upFjq09A>

>From my MacBook's command line I got the test to pass using the same
command but not in Jenkins or from IntelliJ.

I'm happy to post in a new thread if this is distracting and no one
else has seen this before.

Any ideas?

Thanks,
Lars

On Tue, Jul 23, 2013 at 7:01 AM, Stack <stack@duboce.net> wrote:
> nvm.  I read the resourcechecker code.  It is just printing out before and
> afters so my speculation that we are up against fd limits is just off.
>
> Back to figuring out why tests fail at random....
>
> St.Ack
>
>
> On Mon, Jul 22, 2013 at 9:50 PM, Stack <stack@duboce.net> wrote:
>
>> Here is another from tail of
>> https://issues.apache.org/jira/browse/HBASE-5995
>>
>> 2013-07-23 01:23:29,574 INFO  [pool-1-thread-1]
>> hbase.ResourceChecker(171): after:
>> regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was
>> 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor
>> LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was
>> 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -,
>> AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0)
>>
>> This one showed up as a zombie too; stuck.
>>
>> Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/,
>> where we'd had a nice run of passing tests, of a sudden a test that I've
>> not seen fail before, fails:
>>
>> https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/
>>
>>
>> org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK
>>
>> Near the end of the test, the resource checker reports:
>> *
>> *
>>
>>  - Thread LEAK? -, OpenFileDescriptor=100 (was 92) - OpenFileDescriptor LEAK? -,
MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=328 (was 331), ProcessCount=138 (was
138), AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0)
>>
>>
>>
>> Getting tests to pass on these build boxes (other than hadoopqa which is a
>> different set of machines) seems unattainable.
>>
>> I will write infra about the 40k to see if they can do something about
>> that.
>>
>> St.Ack
>>
>>
>>
>>
>> On Mon, Jul 22, 2013 at 9:13 PM, Stack <stack@duboce.net> wrote:
>>
>>> By way of illustration of how loaded Apache build boxes can be:
>>>
>>> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? -,
MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), ProcessCount=142 (was
144), AvailableMemoryMB=819 (was 892), ConnectionCount=0 (was 0)
>>>
>>> This seems to have caused a test that usually passes to fail:
>>> https://issues.apache.org/jira/browse/HBASE-9023
>>>
>>> St.Ack
>>>
>>>
>>> On Mon, Jul 22, 2013 at 11:49 AM, Stack <stack@duboce.net> wrote:
>>>
>>>> Below is a state of hbase 0.95/trunk unit tests (Includes a little
>>>> taxonomy of test failure type definitions).
>>>>
>>>> On Andrew's ec2 build box, 0.95 is passing most of the time:
>>>>
>>>> http://54.241.6.143/job/HBase-0.95/
>>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>>>>
>>>> It is not as good on Apache build box but it is getting better:
>>>>
>>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
>>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>>>>
>>>> On Apache, I have seen loads up in the 500s and all file descriptors
>>>> used according to the little resources report printed at the end of each
>>>> test.  If these numbers are to be believed (TBD), we may never achieve 100%
>>>> pass rate on Apache builds.
>>>>
>>>> Andrew's ec2 builds run the integration tests too where the apache
>>>> builds do not -- sometimes we'll fail an integration test run which makes
>>>> the Andrew ec2 red/green ratio look worse that it actually is.
>>>>
>>>> Trunk builds lag.  They are being worked on.
>>>>
>>>> We seem to be over the worst of the flakey unit tests.  We have a few
>>>> stragglers still but they are being hunted down by the likes of the
>>>> merciless Jimmy Xiang and Jeffrey Zhong.
>>>>
>>>> The "zombies" have been mostly nailed too (where "zombies" are tests
>>>> that refuse to die continuing after the suite has completed causing the
>>>> build to fail).  The zombie trap from test-patch.sh was ported over to
>>>> apache and ec2 build and it caught the last of undying.
>>>>
>>>> We are now into a new phase where "all" tests pass but the build still
>>>> fails.  Here is an example:
>>>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The
only clue I have to go on is the fact that when we fail, the number of
>>>> tests run is less than the total that shows for a successful run.
>>>>
>>>> Unless anyone has a better idea, to figure why the hang, I compare the
>>>> list of tests that show in a good run vs. those of a bad run.  Tests that
>>>> are in the good run but missing from the bad run are deemed suspect.  In
>>>> the absence of  other evidence or other ideas, I am blaming these
>>>> "invisibles" for the build fail.
>>>>
>>>> Here is an example:
>>>>
>>>> This is a good 0.95 hadoop2 run (notice how we are running integration
>>>> tests tooooo and they succeed!!  On hadoop2!!!!):
>>>>
>>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>>>>
>>>> In hbase-server module:
>>>>
>>>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>>>>
>>>>
>>>> This is a bad run:
>>>>
>>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>>>>
>>>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>>>>
>>>>
>>>> If I compare tests, the successful run has:
>>>>
>>>> > Running
>>>> org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>>>>
>>>>
>>>> ... where the bad run does not show the above test.
>>>>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
>>>> would seem to account for the discrepancy.
>>>>
>>>> I've started to disable tests that fail likes this putting them aside
>>>> for original authors or the interested to take a look to see why they fail
>>>> occasionally.  I put them aside so we can enjoy passing builds in the
>>>> meantime.  I've already moved aside or disabled a few tests and test
>>>> classes:
>>>>
>>>> TestMultiTableInputFormat
>>>> TestReplicationKillSlaveRS
>>>> TestHCM.testDeleteForZKConnLeak was disabled
>>>>
>>>> ... and a few others.
>>>>
>>>> Finally (if you are still reading), I would suggest that test failures
>>>> in hadoopqa are now more worthy of investigation.   Illustrative is what
>>>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
>>>> where the patch had +1s and on its first run, a unit test failed (though
it
>>>> passed locally).  The second run obscured the first run's failure.  After
>>>> digging by another, the patch had actually broken the first test (though
it
>>>> looked unrelated).  I would suggest that now tests are healthier, test
>>>> failures are worth paying more attention too.
>>>>
>>>> Yours,
>>>> St.Ack
>>>>
>>>>
>>>>
>>>>
>>>
>>

Mime
View raw message