hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Francis Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock
Date Thu, 16 Feb 2017 18:40:41 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870481#comment-15870481

Francis Liu commented on HBASE-17653:

Thanks [~stack]. Here are my responses:

Throughout this class we're inconsistent with locking on the object returned from this call.
From reading both this class' use, the interface RSGroupInfoManager, and the default implementation
it's not clear to me what correct synchronized handling is supposed to look like.
The strategy is the same as RSgroupInfoManager. Reads don't lock while mutations do. 

However, we synchronize for removing groups but not adding them and sometimes when retrieving
them, so my intuition is that something is incorrect.
This is just cosmetic. The method being called acquires the same lock. You can wrap it in
the same synchronized(manager) block in the class the effect is the same.  Or did I miss something?

if it's rsGroupMap, we still have methods that make use of the map without a proper synchronization
Same reason reads should not be blocked by mutations.

reading through things, I think it's trying to keep updates to the reference for rsGroupMap
atomic (since I *think* all the things written there are unmodifiable, which would make the
unsynchronized access safe). It would be much clearer if we did that via an AtomicReference
or atleast documented that this is what our intention is for all these synchronized methods.
Yep correct making all updates to the map atomic so invariants aren't broken (ie tables or
servers don't popup in more than one group). Not sure why it'd be much clearer with AtomicReference,
volatile should suffice IMHO. Yep document should be enough for this? 

Then, in rb Duo Zhang questioned synchronization here https://reviews.apache.org/r/56570/diff/1?file=1630642#file1630642line221
RSGroupInfoManagerImpl is used by RSGroupAdminServer. It holds the monitor for RSGroupInfoManagerImpl
across a set of operations making the reviewer think this method does not need synchronization.
Seems to me he is making a comment on your addition of the synchronized method not the code
prior to this patch?

All in all the changes seem to be minor? Add explicit synchronized blocks in the places that
does mutations to make it more readable and document the intent of why not synchronizing for
reads? BTW you guys could've just asked me instead of trying to read my mind ;-). 

> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> ----------------------------------------------------------------
>                 Key: HBASE-17653
>                 URL: https://issues.apache.org/jira/browse/HBASE-17653
>             Project: HBase
>          Issue Type: Bug
>          Components: rsgroup
>            Reporter: stack
>            Assignee: stack
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access to the
rsgroup administrator. In tail of HBASE-17624 [~toffer] describes scenario under which we
 may end up in a deadlock (distributed). Let me repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 holds monitor
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock but can't.
rpc thread task won't release monitor lock since rsgroup region never gets assigned. rsgroup
region never gets assigned because it can't update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and meta region
onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have writes block
> ....
> In the code prior to this patch the getter methods that retrieve group information (getRSGroup,
ofTable, OfServer, etc) don't require the monitor lock so the deadlock cycle is broken.
> ....
> The methods that does mutations and updates to zk and hbase:rsgroup are synchronized
appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.

This message was sent by Atlassian JIRA

View raw message