curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J D (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CURATOR-233) Bug in double barrier
Date Mon, 24 Aug 2015 16:29:46 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709562#comment-14709562
] 

J D edited comment on CURATOR-233 at 8/24/15 4:29 PM:
------------------------------------------------------

Hi Mike,

Thanks for looking at this issue.

1. Test case above actually does show the bug
2. Pointer to error in source
3. Due to race conditions, the test case may not always work


1. The output from your test is actually showing the problem (the output messages from the
test are maybe a bit unclear as there are no clear fail/success messages)

09:25:05: 1 entered
09:25:05: 1 sleeping
09:25:05: 0 entered
09:25:05: 0 trying to leave
09:25:15: 1 woke up
09:25:15: 1 trying to leave
09:25:15: 0 left
09:25:15: 1 left

Client 0 is supposed to leave the 2nd barrier within 1 second (at 9:25:06), no matter if there
are enough members or not.
But in reality client 0 only leaves after 10 seconds. Together with client 1 after client
1 has woken up and joined the 2nd barrier.

Method used:

//From method description: Leave the barrier and block until all members have left or the
timeout has elapsed
public synchronized boolean leave(long maxWait, TimeUnit unit) 
{
}



2. Pointer to error in source code
The bug is in the following method

private boolean internalLeave(long startMs, boolean hasMaxWait, long maxWaitMs) throws Exception
{
//...
if (hasMaxWait)
{
  long elapsed = System.currentTimeMillis() - startMs;
  long thisWaitMs = maxWaitMs - elapsed;
  if (thisWaitMs <= 0)
    result = false; //Setting the result to false is not sufficient! Without a break statement
we are stuck in the loop forever (or until enough members have joined the second barrier)
  else
    wait(thisWaitMs);
}
//...
}

(Note: Unfortunately, after a quick look at the code when I discovered the issue, at that
time I had the impression it is not sufficient/correct to simply add a break statement to
fix the bug. I do not remember the problem with that solution in detail but either there were
some edge cases not covered or some cleanup that needed to be done.)


3. The unit test may not work every time due to race conditions. After your post above I made
a few more tests to confirm the issue. Surprisingly, there was one case where the test failed
(i.e. the issue did not occur and the program behaved correctly). I suspect the actual outcome
depends on the ordering of client 0/1's actions. This discovery may make it more difficult
to create a proper unit test for this issue.


17:37:01: 1 trying to enter
17:37:01: 0 trying to enter
17:37:02: 1 entered
17:37:02: 0 entered
17:37:02: 0 trying to leave
17:37:02: 1 sleeping
17:37:12: 1 woke up
17:37:12: 1 trying to leave
17:37:12: 0 left
17:37:12: 1 left
0 was supposed to leave at 17:37:03 (1 second after entry) but only left 9 seconds later

17:38:29: 1 trying to enter
17:38:29: 0 trying to enter
17:38:30: 0 entered
17:38:30: 0 trying to leave
17:38:30: 1 entered
17:38:30: 1 sleeping
17:38:31: 0 left
17:38:40: 1 woke up
17:38:40: 1 trying to leave
17:38:40: 1 left
0 left at 17:38:31 (1 second after entry), that is correct!!!

17:41:26: 0 trying to enter
17:41:26: 1 trying to enter
17:41:27: 1 entered
17:41:27: 0 entered
17:41:27: 0 trying to leave
17:41:27: 1 sleeping
17:41:37: 1 woke up
17:41:37: 1 trying to leave
17:41:38: 0 left
17:41:38: 1 left
0 was supposed to leave at 17:41:28 (1 second after entry) but only left 9 seconds later


17:42:07: 0 trying to enter
17:42:07: 1 trying to enter
17:42:07: 1 entered
17:42:07: 1 sleeping
17:42:07: 0 entered
17:42:07: 0 trying to leave
17:42:17: 1 woke up
17:42:17: 1 trying to leave
17:42:17: 0 left
17:42:17: 1 left
0 was supposed to leave at 17:42:08 (1 second after entry) but only left 9 seconds later


Again, thanks for looking at this issue and please let me know if you need any additional
information or support.

Best Regards,

J D


was (Author: lianro):
Hi Mike,

Thanks for looking at this issue.

1. Test case above actually does show the bug
2. Pointer to error in source
3. Due to race conditions, the test case may not always work


1. The output from your test is actually showing the problem (the output messages from the
test are maybe a bit unclear as there are no clear fail/success messages)

09:25:05: 1 entered
09:25:05: 1 sleeping
09:25:05: 0 entered
09:25:05: 0 trying to leave
09:25:15: 1 woke up
09:25:15: 1 trying to leave
09:25:15: 0 left
09:25:15: 1 left

Client 0 is supposed to leave the 2nd barrier within 1 second (at 9:25:06), no matter if there
are enough members or not.
But in reality client 0 only leaves after 10 seconds. Together with client 1 after client
1 has woken up and joined the 2nd barrier.

Method used:

//From method description: Leave the barrier and block until all members have left or the
timeout has elapsed
public synchronized boolean leave(long maxWait, TimeUnit unit) 
{
}



2. Pointer to error in source code
The bug is in the following method

private boolean internalLeave(long startMs, boolean hasMaxWait, long maxWaitMs) throws Exception
{
...
                if (hasMaxWait)
                {
                    long elapsed = System.currentTimeMillis() - startMs;
                    long thisWaitMs = maxWaitMs - elapsed;
                    if (thisWaitMs <= 0)
                        result = false; //Setting the result to false is not sufficient! Without
a break statement we are stuck in the loop forever (or until enough members have joined the
second barrier)
                    else
                        wait(thisWaitMs);
                }
...
}

(Note: Unfortunately, after a quick look at the code when I discovered the issue, at that
time I had the impression it is not sufficient/correct to simply add a break statement to
fix the bug. I do not remember the problem with that solution in detail but either there were
some edge cases not covered or some cleanup that needed to be done.)


3. The unit test may not work every time due to race conditions. After your post above I made
a few more tests to confirm the issue. Surprisingly, there was one case where the test failed
(i.e. the issue did not occur and the program behaved correctly). I suspect the actual outcome
depends on the ordering of client 0/1's actions. This discovery may make it more difficult
to create a proper unit test for this issue.


17:37:01: 1 trying to enter
17:37:01: 0 trying to enter
17:37:02: 1 entered
17:37:02: 0 entered
17:37:02: 0 trying to leave
17:37:02: 1 sleeping
17:37:12: 1 woke up
17:37:12: 1 trying to leave
17:37:12: 0 left
17:37:12: 1 left
0 was supposed to leave at 17:37:03 (1 second after entry) but only left 9 seconds later

17:38:29: 1 trying to enter
17:38:29: 0 trying to enter
17:38:30: 0 entered
17:38:30: 0 trying to leave
17:38:30: 1 entered
17:38:30: 1 sleeping
17:38:31: 0 left
17:38:40: 1 woke up
17:38:40: 1 trying to leave
17:38:40: 1 left
0 left at 17:38:31 (1 second after entry), that is correct!!!

17:41:26: 0 trying to enter
17:41:26: 1 trying to enter
17:41:27: 1 entered
17:41:27: 0 entered
17:41:27: 0 trying to leave
17:41:27: 1 sleeping
17:41:37: 1 woke up
17:41:37: 1 trying to leave
17:41:38: 0 left
17:41:38: 1 left
0 was supposed to leave at 17:41:28 (1 second after entry) but only left 9 seconds later


17:42:07: 0 trying to enter
17:42:07: 1 trying to enter
17:42:07: 1 entered
17:42:07: 1 sleeping
17:42:07: 0 entered
17:42:07: 0 trying to leave
17:42:17: 1 woke up
17:42:17: 1 trying to leave
17:42:17: 0 left
17:42:17: 1 left
0 was supposed to leave at 17:42:08 (1 second after entry) but only left 9 seconds later


Again, thanks for looking at this issue and please let me know if you need any additional
information or support.

Best Regards,

J D

> Bug in double barrier
> ---------------------
>
>                 Key: CURATOR-233
>                 URL: https://issues.apache.org/jira/browse/CURATOR-233
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.8.0
>            Reporter: J D
>             Fix For: awaiting-response
>
>         Attachments: DoubleBarrierClient.java, DoubleBarrierTester.java
>
>
> Hi,
> I think I discovered a bug in the internalLeave method of the double barrier implementation.
> When a client is told to leave the barrier after maxWait it does not do so. A flag is
set but the client does not leave the barrier, instead it keeps iterating through the control
loop and drives CPU usage to 100%.
> I have attached an example.
> Best regards
> Lianro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message