zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raul Gutierrez Segales (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
Date Mon, 01 Aug 2016 01:25:20 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401425#comment-15401425
] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---------------------------------------------------

[~hanm]: thanks for tracking this down and for the patch! A few questions/asks, looking at
the code:

{code}
Election election = null;
synchronized(self) {
    try {
        rqv = self.configFromString(new String(b));
        QuorumVerifier curQV = self.getQuorumVerifier();
        if (rqv.getVersion() > curQV.getVersion()) {
            LOG.info("{} Received version: {} my version: {}", self.getId(),
                    Long.toHexString(rqv.getVersion()),
                    Long.toHexString(self.getQuorumVerifier().getVersion()));
            if (self.getPeerState() == ServerState.LOOKING) {
                LOG.debug("Invoking processReconfig(), state: {}", self.getServerState());
                self.processReconfig(rqv, null, null, false);
                if (!rqv.equals(curQV)) {
                    LOG.info("restarting leader election");
                    // Signaling quorum peer to restart leader election.
                    self.shuttingDownLE = true;
                     // Get a hold of current leader election object of quorum peer,
                    // so we can clean it up later without holding the lock of quorum
                    // peer. If we shutdown current leader election we will run into
                    // potential deadlock. See ZOOKEEPER-2080 for more details.
                    election = self.getElectionAlg();
                }
            } else {
                LOG.debug("Skip processReconfig(), state: {}", self.getServerState());
            }
        }
    } catch (IOException e) {
        LOG.error("Something went wrong while processing config received from {}", response.sid);
   } catch (ConfigException e) {
       LOG.error("Something went wrong while processing config received from {}", response.sid);
   }
}
{code}

Do we really need to synchronize around self for the first part:

{code}
rqv = self.configFromString(new String(b));
QuorumVerifier curQV = self.getQuorumVerifier();
if (rqv.getVersion() > curQV.getVersion()) {

{code}

? Sounds like that can be done without synchronizing... no? 

Also, given you've spent a good amount of cycles untangling the dependencies around locking
QuorumPeer, could you maybe add a comment before the synchronize(self) block noting why it
is needed and who else might be contending for this lock. Thanks so much!

I think unit testing these things is a bit tricky, we might get a better return by just keeping
better comments around synchronized regions and generally keeping them well maintained (imho).
So I am happy to +1 without tests. 

> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Michael Han
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z,
repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message