Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DB6CE200BB4 for ; Tue, 1 Nov 2016 23:26:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DA3BD160B02; Tue, 1 Nov 2016 22:26:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3766D160B0B for ; Tue, 1 Nov 2016 23:26:00 +0100 (CET) Received: (qmail 95964 invoked by uid 500); 1 Nov 2016 22:25:59 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 95569 invoked by uid 99); 1 Nov 2016 22:25:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Nov 2016 22:25:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0054A2C1F5A for ; Tue, 1 Nov 2016 22:25:59 +0000 (UTC) Date: Tue, 1 Nov 2016 22:25:58 +0000 (UTC) From: "Guozhang Wang (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-4360) Controller may deadLock when autoLeaderRebalance encounter zk expired MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 01 Nov 2016 22:26:01 -0000 [ https://issues.apache.org/jira/browse/KAFKA-4360?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1562= 6869#comment-15626869 ]=20 Guozhang Wang commented on KAFKA-4360: -------------------------------------- Thanks for the find [~Json Tu]. And I agree with Jiangjie that we could con= sider moving {{onControllerResignation}} out of the lock itself. > Controller may deadLock when autoLeaderRebalance encounter zk expired > --------------------------------------------------------------------- > > Key: KAFKA-4360 > URL: https://issues.apache.org/jira/browse/KAFKA-4360 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Json Tu > Labels: bugfix > Attachments: deadlock_patch, yf-mafka2-common02_jstack.txt > > Original Estimate: 168h > Remaining Estimate: 168h > > when controller has checkAndTriggerPartitionRebalance task in autoRebalan= ceScheduler=EF=BC=8Cand then zk expired at that time. It will > run into deadlock. > we can restore the scene as below=EF=BC=8Cwhen zk session expired=EF=BC= =8Czk thread will call handleNewSession which defined in SessionExpirationL= istener, and it will get controllerContext.controllerLock=EF=BC=8Cand then = it will autoRebalanceScheduler.shutdown()=EF=BC=8Cwhich need complete all t= he task in the autoRebalanceScheduler=EF=BC=8Cbut that threadPoll also need= get controllerContext.controllerLock=EF=BC=8Cbut it has already owned by z= k callback thread=EF=BC=8Cwhich will then run into deadlock. > because of that=EF=BC=8Cit will cause two problems at least, first is the= broker=E2=80=99s id is cannot register to the zookeeper=EF=BC=8Cand it wil= l be considered as dead by new controller=EF=BC=8Csecond this procedure can= not be stop by kafka-server-stop.sh, because shutdown function > can not get controllerContext.controllerLock also, we cannot shutdown kaf= ka except using kill -9. > In my attachment, I upload a jstack file, which was created when my kafka= procedure cannot shutdown by kafka-server-stop.sh. > I have met this scenes for several times=EF=BC=8CI think this may be a bu= g that not solved in kafka. -- This message was sent by Atlassian JIRA (v6.3.4#6332)