hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flavio Paiva Junqueira (JIRA)" <j...@apache.org>
Subject [jira] Updated: (ZOOKEEPER-362) Issues with FLENewEpochTest
Date Fri, 03 Apr 2009 13:36:12 GMT

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Flavio Paiva Junqueira updated ZOOKEEPER-362:
---------------------------------------------

    Attachment: ZOOKEEPER-362.patch

This patch fixes the problem in the description. More concretely, it does the following:

1- It synchronizes QuorumCnxManager::connectOne so that there are no competing connections
to the same server;
2- It doesn't remove an existing connection in QuorumCnxManager::receiveConnection when winning
the challenge;
3- it eliminates the second definition of "ss" in QuorumCnxManager::Listener. This was a pretty
silly bug (my fault of course);
4- It adds a deadline to semapahores in FLENewEpochTest so that it doesn't wait indefinitely;
5- If thread 0 finishes before thread 1, then thread 1 initiates a new round after waiting
for 1s. This is what happens in a real deployment as a follower gives up on its elected leader
if the elected leader takes too long to acknowledge its leadership. As we don't run the follower/leader
part of the code in this test, moving to the next round doesn't happen automatically.

> Issues with FLENewEpochTest
> ---------------------------
>
>                 Key: ZOOKEEPER-362
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Flavio Paiva Junqueira
>             Fix For: 3.2.0
>
>         Attachments: ZOOKEEPER-362.patch
>
>
> I have been able to identify two reasons that cause FLENewEpochTest to fail:
> 1- There is a race condition that is triggered when two peers try to establish a connection
to each other for leader election. Basically, if they start roughly at the same time, the
server with highest id will try to open two connections. The two competing connections will
lead to one notification message to be lost. This message happens to be critical for this
two process scenario; 
> 2- The code to shut down a peer is not working well with the unit tests. For this particular
unit test, we need to be able to shut down a peer completely to check the situation the test
tries to reproduce. However, it seems that in some runs timing causes the other peers to believe
it is still alive, and end up electing it. This peer, however, eventually shuts down and leader
election fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message