hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss
Date Sat, 30 Apr 2011 20:57:03 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-3065:
-------------------------

    Attachment: 3065-v3.txt

Here is a non-reversed patch with a fix for compile error.

Would you mind taking a looksee Liyin to see why tests are failing?  Here is failure from
first test in the test suite (mvn clean test):

{code}
 t/s/org.apache.hadoop.hbase.master.TestHMasterRPCException.txt                          
                                                                                         
                                                          
-------------------------------------------------------------------------------
Test set: org.apache.hadoop.hbase.master.TestHMasterRPCException
-------------------------------------------------------------------------------
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.348 sec <<< FAILURE!
testRPCException(org.apache.hadoop.hbase.master.TestHMasterRPCException)  Time elapsed: 0.312
sec  <<< ERROR!
org.apache.hadoop.hbase.ZooKeeperConnectionException: master:57938-0x12fa82ed2230000 Unexpected
KeeperException creating base node
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:160)
    at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:236)
    at org.apache.hadoop.hbase.master.TestHMasterRPCException.testRPCException(TestHMasterRPCException.java:46)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
    at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
    at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140)
    at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165)
    at org.apache.maven.surefire.Surefire.run(Surefire.java:107)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289)
    at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /hbase/unassigned
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:197)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:807)
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:155)
    ... 28 more
{code}

> Retry all 'retryable' zk operations; e.g. connection loss
> ---------------------------------------------------------
>
>                 Key: HBASE-3065
>                 URL: https://issues.apache.org/jira/browse/HBASE-3065
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Liyin Tang
>             Fix For: 0.92.0
>
>         Attachments: 3065-v3.txt, HBase-3065[r1088475]_1.patch, hbase3065_2.patch
>
>
> The 'new' master refactored our zk code tidying up all zk accesses and coralling them
behind nice zk utility classes.  One improvement was letting out all KeeperExceptions letting
the client deal.  Thats good generally because in old days, we'd suppress important state
zk changes in state.  But there is at least one case the new zk utility could handle for the
application and thats the class of retryable KeeperExceptions.  The one that comes to mind
is conection loss.  On connection loss we should retry the just-failed operation.  Usually
the retry will just work.  At worse, on reconnect, we'll pick up the expired session event.

> Adding in this change shouldn't be too bad given the refactor of zk corralled all zk
access into one or two classes only.
> One thing to consider though is how much we should retry.  We could retry on a timer
or we could retry for ever as long as the Stoppable interface is passed so if another thread
has stopped or aborted the hosting service, we'll notice and give up trying.  Doing the latter
is probably better than some kinda timeout.
> HBASE-3062 adds a timed retry on the first zk operation.  This issue is about generalizing
what is over there across all zk access.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message