Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4B1949EE7 for ; Tue, 19 Jun 2012 23:48:43 +0000 (UTC) Received: (qmail 1481 invoked by uid 500); 19 Jun 2012 23:48:42 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 1411 invoked by uid 500); 19 Jun 2012 23:48:42 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 1328 invoked by uid 99); 19 Jun 2012 23:48:42 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 23:48:42 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id A6E6714002F for ; Tue, 19 Jun 2012 23:48:42 +0000 (UTC) Date: Tue, 19 Jun 2012 23:48:42 +0000 (UTC) From: "Jean-Daniel Cryans (JIRA)" To: issues@hbase.apache.org Message-ID: <552466584.32187.1340149722685.JavaMail.jiratomcat@issues-vm> In-Reply-To: <740349443.32168.1340149482547.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (HBASE-6240) Race in HCM.getMaster stalls clients MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-6240: -------------------------------------- Attachment: HBASE-6240.patch Patch that adds the recheck inside the synchronized block, it fixes the issue in my test but I don't know if it breaks anything else at the moment. Only for 0.94 > Race in HCM.getMaster stalls clients > ------------------------------------ > > Key: HBASE-6240 > URL: https://issues.apache.org/jira/browse/HBASE-6240 > Project: HBase > Issue Type: Bug > Affects Versions: 0.94.0 > Reporter: Jean-Daniel Cryans > Priority: Critical > Fix For: 0.94.1 > > Attachments: HBASE-6240.patch > > > I found this issue trying to run YCSB on 0.94, I don't think it exists on any other branch. I believe that this was introduced in HBASE-5058 "Allow HBaseAdmin to use an existing connection". > The issue is that in HCM.getMaster it does this recipe: > # Check if the master is null and runs (if so, return) > # Grab a lock on masterLock > # nullify this.master > # try to get a new master > The issue happens at 3, it should re-run 1 since while you're waiting on the lock someone else could have already fixed it for you. What happens right now is that the threads are all able to set the master to null before others are able to get out of getMaster and it's a complete mess. > Figuring it out took me some time because it doesn't manifest itself right away, silent retries are done in the background. Basically the first clue was this: > {noformat} > Error doing get: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=10, exceptions: > Tue Jun 19 23:40:46 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:40:47 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:40:48 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:40:49 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:40:51 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:40:53 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:40:57 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:41:01 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:41:09 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > Tue Jun 19 23:41:25 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed > {noformat} > This was caused by the little dance up in HBaseAdmin where it deletes "stale" connections... which are not stale at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira