zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Solomon <ms...@dropbox.com>
Subject Re: outstandingChanges queue grows without bound
Date Mon, 24 Oct 2016 02:07:22 GMT
I've pulled this into a separate branch after incorporating some feedback.

https://github.com/msolo/zookeeper/commits/msolo-optimize-close-session

On Fri, Oct 14, 2016 at 12:03 AM, Mike Solomon <msolo@dropbox.com> wrote:
> Thanks for the comments - I'll incorporate them in a future fix. There
> is actually a flaw in this code as it's currently implemented - it
> does not match the original behavior and I need to think more
> carefully.
>
> Arshad, I think ZOOKEEPER-2570 is a somewhat different issue.  The
> root cause in both cases is that the ProcessRequestThread is
> overloaded, but large multi-op transactions are probably a degenerate
> case.
>
> On Thu, Oct 13, 2016 at 1:12 PM, Edward Ribeiro
> <edward.ribeiro@gmail.com> wrote:
>> Very interesting patch, Mike.
>>
>> I've left a couple of review comments (hope you don't mind) in the
>> https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
>> 422b3c8f0c commit. :)
>>
>> Cheers,
>> Eddie
>>
>>
>> On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad <
>> arshad.mohammad.k@gmail.com> wrote:
>>
>>> Hi Mike
>>> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
>>> used to quickly check  performance gains in each modification.  Hope it is
>>> useful.
>>>
>>> -Arshad
>>>
>>> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <msolo@dropbox.com> wrote:
>>>
>>> > I've been performance testing 3.5.2 and hit an interesting unavailability
>>> > issue.
>>> >
>>> > When there server is very busy (64k connections, 16k writes per
>>> > second) the leader can get busy enough that connections get throttled.
>>> > Enough throttling causes sessions to expire. As sessions expire, the
>>> > CPU consumption rises and the quorum is effectively unavailable.
>>> > Interestingly, if you shut down all the clients, the quorum won't heal
>>> > for nearly 10 minutes.
>>> >
>>> > The issue is that the outstandingChanges queue has 250k items in it
>>> > and the closeSession code scans this linearly under a lock. Replacing
>>> > the linear scan with a hash table lookup improves this, but likely the
>>> > real solution is some backpressure on clients as a result of an
>>> > oversized outstandingChanges queue.
>>> >
>>> > Here is a sample fix:
>>> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
>>> > 422b3c8f0c
>>> >
>>> > This results in the quorum healing about 30 seconds after the clients
>>> > disconnect.
>>> >
>>> > Is there a way to prevent runaway growth in this queue? I'm wondering
>>> > if changing the definition of "throttling" to take into account the
>>> > size of this queue might help mitigate this. The end goal is that some
>>> > stable amount of traffic is reached asymptotically without suffering a
>>> > collapse.
>>> >
>>> > Thanks,
>>> > -Mike
>>> >
>>>

Mime
View raw message