hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Build failed in Jenkins: HBase-0.94 #330
Date Tue, 17 Jul 2012 21:15:43 GMT
Thanks for doing this Andy.
I looked through the 0.94 Jenkins runs still available and found the following tests failing:

6 org.apache.hadoop.hbase.TestZooKeeper.testClientSessionExpired
5 org.apache.hadoop.hbase.replication.TestReplicationPeer.testResetZooKeeperSession
2 org.apache.hadoop.hbase.replication.TestReplication.queueFailover
2 org.apache.hadoop.hbase.client.TestAdmin.testEnableTableRoundRobinAssignment
2 org.apache.hadoop.hbase.TestDrainingServer.org.apache.hadoop.hbase.TestDrainingServer
1 org.apache.hadoop.hbase.client.TestAdmin.testOnlineChangeTableSchema
1 org.apache.hadoop.hbase.client.TestShell.testRunShellTests
1 org.apache.hadoop.hbase.coprocessor.TestMasterObserver.testRegionTransitionOperations
1 org.apache.hadoop.hbase.coprocessor.TestMasterObserver.testTableOperations
1 org.apache.hadoop.hbase.io.hfile.TestLruBlockCache.testBackgroundEvictionThread[1]
1 org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat.testExcludeMinorCompaction
1 org.apache.hadoop.hbase.master.TestAssignmentManager.testBalanceOnMasterFailoverScenarioWithClosedNode
1 org.apache.hadoop.hbase.master.TestAssignmentManager.testBalanceOnMasterFailoverScenarioWithOfflineNode
1 org.apache.hadoop.hbase.master.TestSplitLogManager.testVanishingTaskZNode
1 org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol.testRowRange
1 org.apache.hadoop.hbase.regionserver.TestSplitLogWorker.testAcquireTaskAtStartup
1 org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit
1 org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testSplitBeforeSettingSplittingInZK
1 org.apache.hadoop.hbase.util.TestFSUtils.testcomputeHDFSBlocksDistribution
1 org.apache.hadoop.hbase.util.TestHBaseFsck.testNotInMetaHole
1 org.apache.hadoop.hbase.util.TestHBaseFsck.testNotInMetaOrDeployedHole
1 org.apache.hadoop.hbase.util.TestHBaseFsck.testOverlapAndOrphan


I guess the takeaway is that there is not silver bullet here, but that'd we'd get some mileage
by fixing or disabling these two:
org.apache.hadoop.hbase.TestZooKeeper.testClientSessionExpired
org.apache.hadoop.hbase.replication.TestReplicationPeer.testResetZooKeeperSession

They account for 11 of 34 failed runs.

I'll file a jira for me to look at these.

-- Lars


----- Original Message -----
From: Andrew Purtell <apurtell@apache.org>
To: dev@hbase.apache.org
Cc: Lars Hofhansl <lhofhansl@yahoo.com>; michael stack <stack@duboce.net>
Sent: Tuesday, July 17, 2012 10:08 AM
Subject: Re: Build failed in Jenkins: HBase-0.94 #330

I dispatched each unit test individually to 20 EC2 c1.mediums (64 bit
system, 2 VCPUs, kind of slow on purpose but still allowing some
thread concurrency). On the instance each test was run 100 times or
until failure. For each iteration after Maven exited the process table
was checked to see if any surefire processes lingered, and if so the
test would also be reported failed.

OS: Amazon Linux AMI release 2012.03
uname: Linux 3.2.21-1.32.6.amzn1.x86_64 #1 SMP Sat Jun 23 02:32:15 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux
JVM: java version "1.6.0_24"
    OpenJDK Runtime Environment (IcedTea6 1.11.3)
(amazon-52.1.11.3.45.amzn1-x86_64)
    OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Here are the tests that failed to complete successful runs in the above:

TestCatalogTracker
    hangs on a join in testServerNotRunningIOException waiting on a CT
that is stuck on CatalogTracker.waitForMeta and will linger in the
background

TestColumnSeeking
    testDuplicateVersions(org.apache.hadoop.hbase.regionserver.TestColumnSeeking):
expected:<0> but was:<200>

TestAtomicOperation
    testMultiRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation):
expected:<0> but was:<1>

TestSplitLogManager
    testOrphanTaskAcquisition(org.apache.hadoop.hbase.master.TestSplitLogManager):
java.lang.AssertionError

TestRegionRebalancing
    testRebalanceOnRegionServerNumberChange(org.apache.hadoop.hbase.TestRegionRebalancing):
After 5 attempts, region assignments were not balanced.

TestDrainingServer
    junit.framework.AssertionFailedError from
org.apache.hadoop.hbase.TestDrainingServer.setUpBeforeClass

TestMasterObserver
    testTableOperations(org.apache.hadoop.hbase.coprocessor.TestMasterObserver):
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family
'fam2' does not exist
    testRegionTransitionOperations(org.apache.hadoop.hbase.coprocessor.TestMasterObserver):
org.apache.hadoop.hbase.TableExistsException: observed_table

TestServerCustomProtocol
    testSingleMethod(org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol):
Results should contain region
test,bbb,1342509423473.2c0326188f899f3e91ec5eb623959c13. for row 'bbb'

TestFromClientSide
    testPoolBehavior(org.apache.hadoop.hbase.client.TestFromClientSide):
expected:<3> but was:<4>

TestZooKeeper
    testClientSessionExpired(org.apache.hadoop.hbase.TestZooKeeper)

TestReplication
    testDisableInactivePeer(org.apache.hadoop.hbase.replication.TestReplication):
Shutting down

TestMasterReplication
    testSimplePutDelete(org.apache.hadoop.hbase.replication.TestMasterReplication):
Waited too much time for put replication

TestMultiSlaveReplication
    testMultiSlaveReplication(org.apache.hadoop.hbase.replication.TestMultiSlaveReplication):
Unable to add peer

TestReplicationPeer
    testResetZooKeeperSession(org.apache.hadoop.hbase.replication.TestReplicationPeer):
ReplicationPeer ZooKeeper session was not properly expired.

I didn't get to them all before AWS yanked back my spot instances but
I ordered the list from most likely to least, the remaining tests were
in io.hfile.*, thrift.*, and util.*

I'll circle back, confirm each individually, and open JIRAs with more detail.

The cluster of replication test failures are a concern, but I've seen
in other environments such as this one that the tests are timing
dependent. On a slow or busy test system they can fail with "waited
too much time ...". So a solution for this is to not use the system
clock but instead EnvironmentEdge or whatever incremented only when
the test process has CPU time. I haven't looked into this in detail
yet.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)


Mime
View raw message