zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3306) Node may not accessible due the the inconsistent ACL reference map after SNAP sync
Date Mon, 29 Apr 2019 17:51:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829568#comment-16829568

Hudson commented on ZOOKEEPER-3306:

SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #500 (See [https://builds.apache.org/job/ZooKeeper-trunk/500/])
ZOOKEEPER-3306: Fixing node not accessible issue due the inconsistent (andor: rev 46b2018dbe008e010462e0dc06ddb73981d19d2a)
* (edit) zookeeper-server/src/test/java/org/apache/zookeeper/server/persistence/FileTxnSnapLogTest.java
* (edit) zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java

> Node may not accessible due the the inconsistent ACL reference map after SNAP sync 
> -----------------------------------------------------------------------------------
>                 Key: ZOOKEEPER-3306
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3306
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.4, 3.6.0, 3.4.13
>            Reporter: Fangmin Lv
>            Assignee: Fangmin Lv
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 3.6.0
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
> This is a new bug we found on production.
> ZooKeeper uses ACL reference id and count to save the space in snapshot. During fuzzy
snapshot sync, the reference count may not be updated correctly in case like the znode is
already exist.
> When ACL reference count reaches 0, it will be deleted from the system, but actually
there might be other nodes still using it. And when visiting a node with the deleted ACL id,
it will be rejected because it doesn't exist anymore.
> Here is the detailed flow for one of the scenario here:
>  # Server A starts to have snap sync with leader
>  # After serializing the ACL map to Server A, there is a txn T1 to create a node N1
with new ACL_1 which was not exist in ACL map
>  # On leader, after this txn, the ACL map will be ID1 -> (ACL_1, COUNT: 1), and data
tree N1 -> ID1
>  # On server A, it will be empty ACL map, and N1 -> ID1 in fuzzy snapshot
>  # When replaying the txn T1, it will skip at the beginning since the node is already
exist, which leaves an empty ACL map, and N1 is referencing to a non-exist ACL ID1
>  # Node N1 will be not accessible because the ACL not exist, and if it became leader
later then all the write requests will be rejected as well with marshalling error.
> We're still working on the fix, suggestions are welcome.

This message was sent by Atlassian JIRA

View raw message