zookeeper-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fangmin Lv (Jira)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3531) Synchronization on ACLCache cause cluster to hang when network/disk issues happen during datatree serialization
Date Wed, 04 Sep 2019 20:25:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922835#comment-16922835
] 

Fangmin Lv commented on ZOOKEEPER-3531:
---------------------------------------

Thanks [~chlou] for reporting the issue, I've commented in the PR.

> Synchronization on ACLCache cause cluster to hang when network/disk issues happen during
datatree serialization
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3531
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3531
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.2, 3.5.3, 3.5.4, 3.5.5
>            Reporter: Chang Lou
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: fix.patch, generator.py
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> During our ZooKeeper fault injection testing, we observed that sometimes the ZK cluster
could hang (requests time out, node status shows ok). After inspecting the issue, we believe
this is caused by I/O (serializing ACLCache) inside a critical section. The bug is essentially
similar to what is described in ZooKeeper-2201.
> org.apache.zookeeper.server.DataTree#serialize calls the aclCache.serialize when doing
dataree serialization, however, org.apache.zookeeper.server.ReferenceCountedACLCache#serialize
could get stuck at OutputArchieve.writeInt due to potential network/disk issues. This can
cause the system experiences hanging issues similar to ZooKeeper-2201 (any attempt to create/delete/modify
the DataNode will cause the leader to hang at the beginning of the request processor chain).
The root cause is the lock contention between:
>  * org.apache.zookeeper.server.DataTree#serialize -> org.apache.zookeeper.server.ReferenceCountedACLCache#serialize 
>  * PrepRequestProcessor#getRecordForPath -> org.apache.zookeeper.server.DataTree#getACL(org.apache.zookeeper.server.DataNode)
-> org.apache.zookeeper.server.ReferenceCountedACLCache#convertLong
> When the snapshot gets stuck in acl serialization, it would prevent all other operations
to ReferenceCountedACLCache. Since getRecordForPath calls ReferenceCountedACLCache#convertLong,
any op triggering getRecordForPath will cause the leader to hang at the beginning of the request
processor chain:
> {code:java}
> org.apache.zookeeper.server.ReferenceCountedACLCache.convertLong(ReferenceCountedACLCache.java:87)
> org.apache.zookeeper.server.DataTree.getACL(DataTree.java:734)
>    - locked org.apache.zookeeper.server.DataNode@4a062b7d
> org.apache.zookeeper.server.ZKDatabase.aclForNode(ZKDatabase.java:371)
> org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:170)
>    - locked java.util.ArrayDeque@3f7394f7
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:417)
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:757)
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:145)
> {code}
> Similar to ZooKeeper-2201, the leader can still send out heartbeats so the cluster will
not recover until the network/disk issue resolves.  
> Steps to reproduce this bug:
>  # start a cluster with 1 leader and n followers
>  # manually create some ACLs, to enlarge the window of dumping acls so it would be more
likely to hang at serializing ACLCache when delay happens. (we wrote a script to generate
such workloads, see attachments)
>  # inject long network/disk write delays and run some benchmarks to trigger snapshots
>  # once stuck, you should observe new requests to the cluster would fail.
> Essentially the core problem is the OutputArchive write should not be kept inside this
synchronization block. So a straightforward solution is to move writes out of sync block:
do a copy inside the sync block and perform vulnerable network writes afterwards. The patch
for this solution is attached and verified.  Another more systematic fix is perhaps replacing
all synchronized methods in the ReferenceCountedACLCache with ConcurrentHashMap. 
> We double checked that the issue remains in the latest version of master branch (68c21988d55c57e483370d3ee223c22da2d1bbcf). 
> Attachments are 1) patch for fix and regression test 2) scripts to generate workloads
to fill ACL cache



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Mime
View raw message