hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6240) Race in HCM.getMaster stalls clients
Date Mon, 25 Jun 2012 18:54:44 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400772#comment-13400772
] 

Jean-Daniel Cryans commented on HBASE-6240:
-------------------------------------------

bq. Did you mean something else? 

You answered right. It seems to me that if we get an exception, then either the master is
down or it is unreachable. Unfortunately we don't have a concept for the latter.

Looking at the code we do have a isMasterRunning() that encapsulates some of logic about looking
up a master but strangely enough you'll never get *false* getting out of there, it's true
or MasterNotRunningException!

I think we should refactor this in a followup jira. In the meantime +1 on your patch Ram.
                
> Race in HCM.getMaster stalls clients
> ------------------------------------
>
>                 Key: HBASE-6240
>                 URL: https://issues.apache.org/jira/browse/HBASE-6240
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6240.patch, HBASE-6240_1_0.94.patch
>
>
> I found this issue trying to run YCSB on 0.94, I don't think it exists on any other branch.
I believe that this was introduced in HBASE-5058 "Allow HBaseAdmin to use an existing connection".
> The issue is that in HCM.getMaster it does this recipe:
>  # Check if the master is null and runs (if so, return)
>  # Grab a lock on masterLock
>  # nullify this.master
>  # try to get a new master
> The issue happens at 3, it should re-run 1 since while you're waiting on the lock someone
else could have already fixed it for you. What happens right now is that the threads are all
able to set the master to null before others are able to get out of getMaster and it's a complete
mess.
> Figuring it out took me some time because it doesn't manifest itself right away, silent
retries are done in the background. Basically the first clue was this:
> {noformat}
> Error doing get: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=10, exceptions:
> Tue Jun 19 23:40:46 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:40:47 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:40:48 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:40:49 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:40:51 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:40:53 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:40:57 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:41:01 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:41:09 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> Tue Jun 19 23:41:25 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException:
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
> {noformat}
> This was caused by the little dance up in HBaseAdmin where it deletes "stale" connections...
which are not stale at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message