Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 526A4200C2C for ; Fri, 17 Feb 2017 01:12:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 51039160B6F; Fri, 17 Feb 2017 00:12:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 97E05160B61 for ; Fri, 17 Feb 2017 01:12:03 +0100 (CET) Received: (qmail 71041 invoked by uid 500); 17 Feb 2017 00:12:02 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 71030 invoked by uid 99); 17 Feb 2017 00:12:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2017 00:12:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 27593C07DA for ; Fri, 17 Feb 2017 00:12:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id NGDju4Le2KEI for ; Fri, 17 Feb 2017 00:12:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 5123E5FAFB for ; Fri, 17 Feb 2017 00:12:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 996E3E0419 for ; Fri, 17 Feb 2017 00:11:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id BF5C324121 for ; Fri, 17 Feb 2017 00:11:41 +0000 (UTC) Date: Fri, 17 Feb 2017 00:11:41 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Feb 2017 00:12:04 -0000 [ https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-17653: -------------------------- Attachment: HBASE-17653.master.001.patch > HBASE-17624 rsgroup synchronizations will (distributed) deadlock > ---------------------------------------------------------------- > > Key: HBASE-17653 > URL: https://issues.apache.org/jira/browse/HBASE-17653 > Project: HBase > Issue Type: Bug > Components: rsgroup > Reporter: stack > Assignee: stack > Attachments: HBASE-17653.master.001.patch > > > Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes scenario under which we may end up in a deadlock (distributed). Let me repeat [~toffer] comment... > {code} > Both read/write access can't be single threaded. Consider the situation: > 1. move_rsgroup_servers is called > 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 holds monitor lock) > 3. while #2 is happening meta is in transition. > Balancer tries to figure out plan for meta region tries to get monitor lock but can't. rpc thread task won't release monitor lock since rsgroup region never gets assigned. rsgroup region never gets assigned because it can't update meta with new state. > There's a good chance this can be reproduce just by moving both rsgroup and meta region onto the same RS and call move_rsgoup_servers on the same RS. > A bunch different actors will query from group affiliation so we can't have writes block reads. > .... > In the code prior to this patch the getter methods that retrieve group information (getRSGroup, ofTable, OfServer, etc) don't require the monitor lock so the deadlock cycle is broken. > .... > The methods that does mutations and updates to zk and hbase:rsgroup are synchronized appropriately. Point me to where the incoherence is? > {code} > This issue is about testing/fixing/restoring rsgroup access. Will be back. -- This message was sent by Atlassian JIRA (v6.3.15#6346)