helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhen Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HELIX-264) fix zkclient#close() bug
Date Wed, 02 Oct 2013 18:23:44 GMT
Zhen Zhang created HELIX-264:

             Summary: fix zkclient#close() bug
                 Key: HELIX-264
                 URL: https://issues.apache.org/jira/browse/HELIX-264
             Project: Apache Helix
          Issue Type: Bug
            Reporter: Zhen Zhang
            Assignee: Zhen Zhang
            Priority: Critical

When the flapping is detected, we are in the zkclient event thread context and we are calling
zkclient.close() from its own event thread. Here is the ZkClient#close():

    public void close() throws ZkInterruptedException {
        if (_connection == null) {
        LOG.debug("Closing ZkClient...");
        try {
            _connection = null;
        } catch (InterruptedException e) {
            throw new ZkInterruptedException(e);
        } finally {
        LOG.debug("Closing ZkClient...done");

_eventThread.interrupt(); <-- will set interrupt status of _eventThread which is in fact
the currentThread.
_eventThread.join(2000); <-- will throw InterruptedException because currentThread has
been interrupted.
_connection.close(); <-- SKIPPED!!!

So if flapping happens, we are calling ZkHelixManager#disconnectInternal(), which will always
interrupt ZkClient#_eventThread but never disconnect the zk connection. This is probably a
zkclient bug that we should never call zkclient.close() from its own event thread context.

fix steps:
1) workaround for this bug
2) add test cases for flapping detection
3) explore the possibility to have controller detect flapping participants and disable them
(may via querying zk-server jmx metrics)

This message was sent by Atlassian JIRA

View raw message