Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B02A5200C1E for ; Fri, 17 Feb 2017 23:46:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id AD071160B57; Fri, 17 Feb 2017 22:46:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 03179160B46 for ; Fri, 17 Feb 2017 23:46:48 +0100 (CET) Received: (qmail 81173 invoked by uid 500); 17 Feb 2017 22:46:47 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 81162 invoked by uid 99); 17 Feb 2017 22:46:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2017 22:46:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1911CC023B for ; Fri, 17 Feb 2017 22:46:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id D1YeOw0ZCk4V for ; Fri, 17 Feb 2017 22:46:46 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 02D185F613 for ; Fri, 17 Feb 2017 22:46:46 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A153BE0811 for ; Fri, 17 Feb 2017 22:46:44 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 233922411F for ; Fri, 17 Feb 2017 22:46:44 +0000 (UTC) Date: Fri, 17 Feb 2017 22:46:44 +0000 (UTC) From: "stack (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Feb 2017 22:46:49 -0000 [ https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-17653. --------------------------- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.0.0 Pushed to master. Thanks for review [~toffer] > HBASE-17624 rsgroup synchronizations will (distributed) deadlock > ---------------------------------------------------------------- > > Key: HBASE-17653 > URL: https://issues.apache.org/jira/browse/HBASE-17653 > Project: HBase > Issue Type: Bug > Components: rsgroup > Reporter: stack > Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-17653.master.001.patch, HBASE-17653.master.002.patch, HBASE-17653.master.003.patch > > > Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes scenario under which we may end up in a deadlock (distributed). Let me repeat [~toffer] comment... > {code} > Both read/write access can't be single threaded. Consider the situation: > 1. move_rsgroup_servers is called > 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 holds monitor lock) > 3. while #2 is happening meta is in transition. > Balancer tries to figure out plan for meta region tries to get monitor lock but can't. rpc thread task won't release monitor lock since rsgroup region never gets assigned. rsgroup region never gets assigned because it can't update meta with new state. > There's a good chance this can be reproduce just by moving both rsgroup and meta region onto the same RS and call move_rsgoup_servers on the same RS. > A bunch different actors will query from group affiliation so we can't have writes block reads. > .... > In the code prior to this patch the getter methods that retrieve group information (getRSGroup, ofTable, OfServer, etc) don't require the monitor lock so the deadlock cycle is broken. > .... > The methods that does mutations and updates to zk and hbase:rsgroup are synchronized appropriately. Point me to where the incoherence is? > {code} > This issue is about testing/fixing/restoring rsgroup access. Will be back. -- This message was sent by Atlassian JIRA (v6.3.15#6346)