helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Solanas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HELIX-264) fix zkclient#close() bug
Date Thu, 03 Oct 2013 19:48:43 GMT

     [ https://issues.apache.org/jira/browse/HELIX-264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeff Solanas updated HELIX-264:
-------------------------------

    Sprint: Sprint #4 10/2 - 10/16

> fix zkclient#close() bug
> ------------------------
>
>                 Key: HELIX-264
>                 URL: https://issues.apache.org/jira/browse/HELIX-264
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Zhen Zhang
>            Assignee: Zhen Zhang
>            Priority: Critical
>
> When the flapping is detected, we are in the zkclient event thread context and we are
calling zkclient.close() from its own event thread. Here is the ZkClient#close():
>     public void close() throws ZkInterruptedException {
>         if (_connection == null) {
>             return;
>         }
>         LOG.debug("Closing ZkClient...");
>         getEventLock().lock();
>         try {
>             setShutdownTrigger(true);
>             _eventThread.interrupt();
>             _eventThread.join(2000);
>             _connection.close();
>             _connection = null;
>         } catch (InterruptedException e) {
>             throw new ZkInterruptedException(e);
>         } finally {
>             getEventLock().unlock();
>         }
>         LOG.debug("Closing ZkClient...done");
>     }
> _eventThread.interrupt(); <-- will set interrupt status of _eventThread which is in
fact the currentThread.
> _eventThread.join(2000); <-- will throw InterruptedException because currentThread
has been interrupted.
> _connection.close(); <-- SKIPPED!!!
> So if flapping happens, we are calling ZkHelixManager#disconnectInternal(), which will
always interrupt ZkClient#_eventThread but never disconnect the zk connection. This is probably
a zkclient bug that we should never call zkclient.close() from its own event thread context.
> fix steps:
> 1) workaround for this bug
> 2) add test cases for flapping detection
> 3) explore the possibility to have controller detect flapping participants and disable
them (may via querying zk-server jmx metrics)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message