zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2684) Fix a crashing bug in the mixed workloads commit processor
Date Tue, 31 Oct 2017 20:21:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227448#comment-16227448
] 

ASF GitHub Bot commented on ZOOKEEPER-2684:
-------------------------------------------

Github user afine commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/411#discussion_r148115810
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -246,33 +246,51 @@ public void run() {
                         }
     
                         /*
    -                     * Check if request is pending, if so, update it with the
    -                     * committed info
    +                     * Check if request is pending, if so, update it with the committed
info
                          */
                         LinkedList<Request> sessionQueue = pendingRequests
                                 .get(request.sessionId);
                         if (sessionQueue != null) {
                             // If session queue != null, then it is also not empty.
                             Request topPending = sessionQueue.poll();
                             if (request.cxid != topPending.cxid) {
    -                            LOG.error(
    -                                    "Got cxid 0x"
    -                                            + Long.toHexString(request.cxid)
    -                                            + " expected 0x" + Long.toHexString(
    -                                                    topPending.cxid)
    -                                    + " for client session id "
    -                                    + Long.toHexString(request.sessionId));
    -                            throw new IOException("Error: unexpected cxid for"
    -                                    + "client session");
    +                            // TL;DR - we should not encounter this scenario often under
normal load.
    +                            // We pass the commit to the next processor and put the pending
back with a warning.
    +
    +                            // Generally, we can get commit requests that are not at
the queue head after
    +                            // a session moved (see ZOOKEEPER-2684). Let's denote the
previous server of the session
    +                            // with A, and the server that the session moved to with
B (keep in mind that it is
    +                            // possible that the session already moved from B to a new
server C, and maybe C=A).
    +                            // 1. If request.cxid < topPending.cxid : this means that
the session requested this update
    +                            // from A, then moved to B (i.e., which is us), and now B
receives the commit
    +                            // for the update after the session already performed several
operations in B
    +                            // (and therefore its cxid is higher than that old request).
    +                            // 2. If request.cxid > topPending.cxid : this means that
the session requested an updated
    +                            // from B with cxid that is bigger than the one we know therefore
in this case we
    +                            // are A, and we lost the connection to the session. Given
that we are waiting for a commit
    +                            // for that update, it means that we already sent the request
to the leader and it will
    +                            // be committed at some point (in this case the order of
cxid won't follow zxid, since zxid
    +                            // is an increasing order). It is not safe for us to delete
the session's queue at this
    +                            // point, since it is possible that the session has newer
requests in it after it moved
    +                            // back to us. We just leave the queue as it is, and once
the commit arrives (for the old
    +                            // request), the finalRequestProcessor will see a closed
cnxn handle, and just won't send a
    +                            // response.
    +                            // Also note that we don't have a local session, therefore
we treat the request
    +                            // like any other commit for a remote request, i.e., we perform
the update without sending
    +                            // a response.
    +
    +                            LOG.warn("Got request " + request +
    +                                    " but we are expecting request " + topPending);
    +                            sessionQueue.addFirst(topPending);
    +                        } else {
    +                            // We want to send to the next processor our version of the
request,
    +                            // since it contains the session information that is needed
    +                            // for post update processing (e.g., using request.cnxn we
send a response to the client).
    +                            topPending.setHdr(request.getHdr());
    --- End diff --
    
    The explanation is really great. I was also hoping you could shed some light on this part
of the code. Why do we "want to send to the next processor our version of the request" and
why can we proceed in the other case if this is required here?


> Fix a crashing bug in the mixed workloads commit processor
> ----------------------------------------------------------
>
>                 Key: ZOOKEEPER-2684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2684
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.0
>         Environment: with pretty heavy load on a real cluster
>            Reporter: Ryan Zhang
>            Assignee: Ryan Zhang
>            Priority: Blocker
>         Attachments: ZOOKEEPER-2684.patch
>
>
> We deployed our build with ZOOKEEPER-2024 and it quickly started to crash with the following
error
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:24:42,305 - ERROR [CommitProcessor:2] -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
– Got cxid 0x119fa expected 0x11fc5 for client session id 1009079ba470055
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:32:04,746 - ERROR [CommitProcessor:2] -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
– Got cxid 0x698 expected 0x928 for client session id 4002eeb3fd0009d
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:34:46,648 - ERROR [CommitProcessor:2] -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
– Got cxid 0x8904 expected 0x8f34 for client session id 51b8905c90251
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:43:46,834 - ERROR [CommitProcessor:2] -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
– Got cxid 0x3a8d expected 0x3ebc for client session id 2051af11af900cc
> clearly something is not right in the new commit processor per session queue implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message