Return-Path: X-Original-To: apmail-incubator-kafka-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-kafka-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DFDA49143 for ; Thu, 6 Sep 2012 18:11:07 +0000 (UTC) Received: (qmail 46129 invoked by uid 500); 6 Sep 2012 18:11:07 -0000 Delivered-To: apmail-incubator-kafka-dev-archive@incubator.apache.org Received: (qmail 46096 invoked by uid 500); 6 Sep 2012 18:11:07 -0000 Mailing-List: contact kafka-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: kafka-dev@incubator.apache.org Delivered-To: mailing list kafka-dev@incubator.apache.org Received: (qmail 46084 invoked by uid 99); 6 Sep 2012 18:11:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Sep 2012 18:11:07 +0000 Date: Fri, 7 Sep 2012 05:11:07 +1100 (NCT) From: "Neha Narkhede (JIRA)" To: kafka-dev@incubator.apache.org Message-ID: <399541382.45689.1346955067761.JavaMail.jiratomcat@arcas> In-Reply-To: <1931213865.45186.1346950388421.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (KAFKA-498) Controller code has race conditions and synchronization bugs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/KAFKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-498: -------------------------------- Attachment: kafka-498-v1.patch 1. Refactoring of current procedural controller code to functional style 1.1. Using switch case statements instead of if-else 1.2. Avoid explicit return statements from try-catch blocks 1.3. Use Option instead of null. This is very important, there were several places that handled only the non-null case, basically making way for NPE's in error cases. 1.4. Rename variables in all capital letters to camel case 1.5. Fixed logging since a lot of log statements were at info and were somewhat unclear 1.6. Fixed indentation, some places uses 4 spaces, others used 2 spaces. 1.7. ZkUtils.scala 1.7.1 Refactored readDataMaybeNull to return an Option instead of null. This allows all usages of that API that query an ephemeral path to handle the case when the value in zookeeper does not exist anymore 1.7.2. Refactored getBrokerInfoFromIds to handle only a single broker id. The reason is that most usages of that API used it to query for a single broker id. 1.8 Many places in the code did not wrap the lines correctly. Fixed this as much as I could. 2. KafkaController.scala 2.1/ Renamed deliverLeaderAndISRFromZookeeper to readLeaderAndIsrFromZookeeper 2.2 There is a race condition in KafkaController where it doesn't synchronize access to the controller's data structures while creating a new session. Basically, controllerRegisterOrFailover is a private API that modifies almost all the internal controller data structures that require synchronization. Since this API is not synchronizing on the controller lock, all usages of this API need to do this correctly. Fixed handleNewSession to synchronize the controllerRegisterOrFailover API, since that can be concurrently executed with the startup procedure. 2.3 In onBrokerChange, defaulting to empty list instead of null. This lets us avoid null checks 2.4 Refactored onBrokerChange() API and moved the leader election logic to a separate API. After starting on this path, I figured that this code is going to need a major refactoring that I'd like to fix in a separate patch. Filed KAFKA-499 to cover that. For now, we can keep this although it will look complete only after KAFKA-499 is in. 2.5 onBrokerChange() and initLeaders() APIs are very similar and duplicate quite a lot of code. I refactored onBrokerChange() but realized later that initLeaders would have to refactored too. To keep the changes in this patch small enough, I will fix this as part of KAFKA-499. 3. ControllerChannelManager & KafkaController 3.1 There are 3 types of information that the controller maintains per broker - request thread, message queue and socket channel. Currently, they are maintained in 3 separate variables and we need to ensure all 3 are synchronized correctly. To simplify this, I created a case class to wrap this state in one object. 3.2 There is a lock object whose purpose is to synchronize access to the broker cache. Currently, the lock doesn't seem to protect all access to these caches, which looks like a synchronization bug. Fixed this to have startup API synchronize access to the broker cache. 3.3. The addBroker logic was duplicated in 2 places. Refactored the constructors to add an auxilary constructor to call the addBroker API that handles creating controller state info for a new broker and the appropriate synchronization. Currently, the constructor duplicates code to add a new broker, probably since it is not right to invoke an API from inside the primary constructor 3.4 There are 2 ways to remove a broker from the controller's broker cache - during shutdown of the channel manager or when a broker change listener fires. Since removeBroker is a public API, it is synchronized using the brokerLock. The shutdown API calls removeBroker internally and ends up acquiring and releasing the lock multiple times. Ideally, it is sufficient to acquire the brokerLock just once for the entire shutdown API. Refactored to move the common removeBroker logic to a private API that doesn't synchronize on the brokerLock. Changed removeBroker to acquire lock and call removeExistingBroker, changed shutdown to acquire lock once and call removeExistingBroker. 4. Renamed LeaderAndISR to LeaderAndIsr 5. Renamed BrokerNotExistException to BrokerNotAvailableException to remain consistent with other exceptions of the same type (LeaderNotAvailableException, ReplicaNotAvailableException) NOTE: Apache svn seems to hang at the time of uploading this patch. It will apply cleanly on revision 1380945 > Controller code has race conditions and synchronization bugs > ------------------------------------------------------------ > > Key: KAFKA-498 > URL: https://issues.apache.org/jira/browse/KAFKA-498 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Neha Narkhede > Assignee: Neha Narkhede > Labels: bugs > Attachments: kafka-498-v1.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > The controller maintains some internal data structures that are updated by state changes triggered by zookeeper listeners. There are race conditions in the controller channel manager and the controller state machine. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira