activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam hendley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMQ-4512) MemoryUsage left in inconsistent state
Date Mon, 06 May 2013 22:30:17 GMT

     [ https://issues.apache.org/jira/browse/AMQ-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sam hendley updated AMQ-4512:
-----------------------------

    Description: 
There is a race condition in MemoryUsage which makes it possible for it to be left in an inconsistent
state and thereby hang any clients in waitForSpace().

The core issue is in the following block of code:
{code:java}
    public void decreaseUsage(long value) {
        if (value == 0) {
            return;
        }
        int percentUsage;
        synchronized (usageMutex) {
            usage -= value;
            percentUsage = caclPercentUsage();
        }
        setPercentUsage(percentUsage);
        if (parent != null) {
            parent.decreaseUsage(value);
        }
    }
{code}

The bug occurs when multiple threads are calling increment/decrement at same time. Since the
field "usage" is protected with the usageMutex each writer/reader will see the correct and
current value for usage and calculate the right value for percentUsage at that instant. "setPercentUsage"
is also   protected by the same usageMutex so we resyncronize on usageMutex to set the percentUsage
field as well. The issue is that threads may enter the setPercentUsage synchronized block
in a different order than they entered the percentUsage "calculating" block. Since percentUsage
is carried between the two blocks, a reordering of threads can allow the wrong final percentUsage
value to be set. 

Possible threading (imagine usage starts at 0 and limit is 100).
Thread #1 - usage += 150; percentUsage = 150;
Thread #1 - suspended before called setPercentUsage
Thread #2 - usage -= 150; percentUsage = 0;
Thread #2 - setPercentUsage(0);
Thread #1 - resumed, can now call setPercentUsage
Thread #1 - setPercentUsage(150);

Final value = 150

This same pattern of synchronizing to calculate the percentUsage and then setting the value
later is repeated in all of the Usage objects I looked at. My guess it was written this way
to avoid holding locks while making calls out to "untrusted code" but it is a very dangerous
way to do the calculations. The most surprising thing is the locks are still being explicitly
held while calling fireEvent anyways.

I have attached two screenshots taken using a debugger of two threads that have both been
stalled for multiple minutes on "waitForSpace" trying to publish to the same queue. Notice
they both have a "usage" of 0 but a "percentUsage" > 100, this should be impossible. To
get the system into this state I was using JmsTemplate and a CachingConnectionFactory to publish
on 8 threads and a single DefaultMessageListenerContainer who is pulling the messages off
as fast as possible. The test publishes 100000 measurements and around ~75% of the time atleast
a few producers end up stalled in waitForSpace() even though the queue is ready for more messages.
I can also reproduce these results using JmsTemplate and PooledConnectionFactory so I don't
believe it's an issue in the pooling implementations.

  was:
There is a race condition in MemoryUsage which makes it possible for it to be left in an inconsistent
state and thereby hang any clients in waitForSpace().

The core issue is in the following block of code:
{code:java}
    public void decreaseUsage(long value) {
        if (value == 0) {
            return;
        }
        int percentUsage;
        synchronized (usageMutex) {
            usage -= value;
            percentUsage = caclPercentUsage();
        }
        setPercentUsage(percentUsage);
        if (parent != null) {
            parent.decreaseUsage(value);
        }
    }
{code}

The bug occurs when multiple threads are calling increment/decrement at same time. Since the
field "usage" is protected with the usageMutex each writer/reader will see the correct and
current value for usage and calculate the right value for percentUsage at that instant. "setPercentUsage"
is also   protected by the same usageMutex so we resyncronize on usageMutex to set the percentUsage
field as well. The issue is that threads may enter the setPercentUsage synchronized block
in a different order than they entered the percentUsage "calculating" block. Since percentUsage
is carried between the two blocks, a reordering of threads can allow the wrong final percentUsage
value to be set. 

Possible threading (imagine usage starts at 0 and limit is 100).
Thread #1 - usage += 150; percentUsage = 150;
Thread #1 - suspended before called setPercentUsage
Thread #2 - usage -= 150; percentUsage = 0;
Thread #2 - setPercentUsage(0);
Thread #1 - resumed, can now call setPercentUsage
Thread #1 - setPercentUsage(150);

Final value = 150

This same pattern of synchronizing to calculate the percentUsage and then setting the value
later is repeated in all of the Usage objects I looked at. My guess it was written this way
to avoid holding locks while making calls out to "untrusted code" but it is a very dangerous
way to do the calculations. The most disappointing thing is the locks are still being explicitly
held while calling fireEvent anyways.

I will be attaching more details once ticket is created.

    
> MemoryUsage left in inconsistent state
> --------------------------------------
>
>                 Key: AMQ-4512
>                 URL: https://issues.apache.org/jira/browse/AMQ-4512
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 5.6.0, 5.7.0
>            Reporter: Sam hendley
>         Attachments: MemUsage1.PNG, MemUsage2.PNG
>
>
> There is a race condition in MemoryUsage which makes it possible for it to be left in
an inconsistent state and thereby hang any clients in waitForSpace().
> The core issue is in the following block of code:
> {code:java}
>     public void decreaseUsage(long value) {
>         if (value == 0) {
>             return;
>         }
>         int percentUsage;
>         synchronized (usageMutex) {
>             usage -= value;
>             percentUsage = caclPercentUsage();
>         }
>         setPercentUsage(percentUsage);
>         if (parent != null) {
>             parent.decreaseUsage(value);
>         }
>     }
> {code}
> The bug occurs when multiple threads are calling increment/decrement at same time. Since
the field "usage" is protected with the usageMutex each writer/reader will see the correct
and current value for usage and calculate the right value for percentUsage at that instant.
"setPercentUsage" is also   protected by the same usageMutex so we resyncronize on usageMutex
to set the percentUsage field as well. The issue is that threads may enter the setPercentUsage
synchronized block in a different order than they entered the percentUsage "calculating" block.
Since percentUsage is carried between the two blocks, a reordering of threads can allow the
wrong final percentUsage value to be set. 
> Possible threading (imagine usage starts at 0 and limit is 100).
> Thread #1 - usage += 150; percentUsage = 150;
> Thread #1 - suspended before called setPercentUsage
> Thread #2 - usage -= 150; percentUsage = 0;
> Thread #2 - setPercentUsage(0);
> Thread #1 - resumed, can now call setPercentUsage
> Thread #1 - setPercentUsage(150);
> Final value = 150
> This same pattern of synchronizing to calculate the percentUsage and then setting the
value later is repeated in all of the Usage objects I looked at. My guess it was written this
way to avoid holding locks while making calls out to "untrusted code" but it is a very dangerous
way to do the calculations. The most surprising thing is the locks are still being explicitly
held while calling fireEvent anyways.
> I have attached two screenshots taken using a debugger of two threads that have both
been stalled for multiple minutes on "waitForSpace" trying to publish to the same queue. Notice
they both have a "usage" of 0 but a "percentUsage" > 100, this should be impossible. To
get the system into this state I was using JmsTemplate and a CachingConnectionFactory to publish
on 8 threads and a single DefaultMessageListenerContainer who is pulling the messages off
as fast as possible. The test publishes 100000 measurements and around ~75% of the time atleast
a few producers end up stalled in waitForSpace() even though the queue is ready for more messages.
I can also reproduce these results using JmsTemplate and PooledConnectionFactory so I don't
believe it's an issue in the pooling implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message