Return-Path: X-Original-To: apmail-helix-user-archive@minotaur.apache.org Delivered-To: apmail-helix-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5372D10C18 for ; Mon, 16 Feb 2015 18:28:41 +0000 (UTC) Received: (qmail 99201 invoked by uid 500); 16 Feb 2015 18:28:41 -0000 Delivered-To: apmail-helix-user-archive@helix.apache.org Received: (qmail 99156 invoked by uid 500); 16 Feb 2015 18:28:41 -0000 Mailing-List: contact user-help@helix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@helix.apache.org Delivered-To: mailing list user@helix.apache.org Received: (qmail 99132 invoked by uid 99); 16 Feb 2015 18:28:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Feb 2015 18:28:40 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nehzgnahz@gmail.com designates 209.85.223.179 as permitted sender) Received: from [209.85.223.179] (HELO mail-ie0-f179.google.com) (209.85.223.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Feb 2015 18:28:37 +0000 Received: by ierx19 with SMTP id x19so35789424ier.3; Mon, 16 Feb 2015 10:28:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=rSzlfyzeR0mzP2oW3jOqzi5b/9W1+zOR8Q6WcYfW8es=; b=WaRoLBee5uefRhzMS0Clgy5P5PKU2LVzMueHrtyG+JeJtb3vgbVP2tiJDA6rM1Rrkf 9ZskYdKcmFFKwDIiUK/e/RqW8yHk3xCkCW0oEgKHX7z/QjHAWs7EC5LHzpzusyXR5Qp1 WwYQrOY+9f73O+PE5gvzHRXhOTpPdBj1x4gKEInHBfAcTrnkhYjXJ+CliqPmGi+p2mu4 zX20nW3vV7y8a8UsppWhK0hsZuOc7Tr76XZM3uy5/zglrTXNLE4yiGpbBYrL4dLF1aH+ bdWQZ4IfrCfNs+PGtnSGliE66bQZdDWF2kCH2wtrdE7m4f+Ob8eIBZzWZkwvFvEofVxH 1T2Q== MIME-Version: 1.0 X-Received: by 10.107.12.167 with SMTP id 39mr9905964iom.71.1424111296528; Mon, 16 Feb 2015 10:28:16 -0800 (PST) Received: by 10.43.157.72 with HTTP; Mon, 16 Feb 2015 10:28:16 -0800 (PST) In-Reply-To: References: Date: Mon, 16 Feb 2015 10:28:16 -0800 Message-ID: Subject: Re: NPE during start up From: Zhen Zhang To: "user@helix.apache.org" Cc: "dev@helix.apache.org" Content-Type: multipart/alternative; boundary=001a113fb866f347a2050f38bfed X-Virus-Checked: Checked by ClamAV on apache.org --001a113fb866f347a2050f38bfed Content-Type: text/plain; charset=UTF-8 There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We should hook up ZK callback (line 347 in https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java) after all message handler registrations are done (line 354 in https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java). Fix is to move adding ZK callback to the end. Will add a test case that can reliably reproduce this issue. Thanks, Zhen On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang wrote: > might be some race conditions. need to double check this. > On Feb 15, 2015 11:38 PM, "Steph Meslin-Weber" > wrote: > >> Hi Kishore, >> >> That's right, the node doesn't process any state transitions. They should >> have been logged in the first set of logs had they occurred. >> >> Thanks, >> Steph >> On 16 Feb 2015 07:28, "kishore g" wrote: >> >>> Hi Steph, >>> >>> When the NPE occurs, do you get the state transition callbacks? >>> >>> thanks, >>> Kishore G >>> >>> >>> >>> On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber < >>> steph@tangency.co.uk> wrote: >>> >>>> Unfortunately it appears that when the NPE occurs, dropping the >>>> participant no longer cleans up the related INSTANCE node. Perhaps some >>>> state is lost? >>>> >>>> Thanks, >>>> Steph >>>> On 16 Feb 2015 06:52, "Zhen Zhang" wrote: >>>> >>>>> I think the NPE is not fatal. It happens when no message handler >>>>> factory is registered for this message type. The message will not be >>>>> removed and remain in UNREAD state. Later when the message handler factory >>>>> is registered via: >>>>> DefaultMessagingService#registerMessageHandlerFactory, we will send a >>>>> NOP message, which will in turn trigger HelixTaskExecutor to process all >>>>> UNREAD messages. We should definitely fix this by logging a warning message >>>>> instead of throwing an NPE. >>>>> >>>>> Thanks, >>>>> Jason >>>>> >>>>> >>>>> On Sun, Feb 15, 2015 at 7:30 PM, kishore g >>>>> wrote: >>>>> >>>>>> Controller assuming the state transition occurred is even more >>>>>> dangerous. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Feb 15, 2015 at 7:18 PM, vlad.gm@gmail.com >>>>> > wrote: >>>>>> >>>>>>> In my experience it was fatal. The callback would jot be called but >>>>>>> the >>>>>>> controller would somehow assume the state transition occurred. >>>>>>> On Feb 15, 2015 7:13 PM, "kishore g" wrote: >>>>>>> >>>>>>> > Thanks Vlad. That explains the problem. That also explains how >>>>>>> adding >>>>>>> > sleep of 3seconds work. >>>>>>> > >>>>>>> > Jason, is this exception fatal?. Will the message be processed >>>>>>> again after >>>>>>> > the handler is added. >>>>>>> > >>>>>>> > thanks, >>>>>>> > Kishore G >>>>>>> > >>>>>>> > On Sun, Feb 15, 2015 at 6:41 PM, vlad.gm@gmail.com < >>>>>>> vlad.gm@gmail.com> >>>>>>> > wrote: >>>>>>> > >>>>>>> >> https://issues.apache.org/jira/browse/HELIX-548 >>>>>>> >> On Feb 15, 2015 6:38 PM, "kishore g" wrote: >>>>>>> >> >>>>>>> >> > Hi Vlad, >>>>>>> >> > >>>>>>> >> > Was there any jira associated with it? >>>>>>> >> > >>>>>>> >> > thanks. >>>>>>> >> > Kishore G >>>>>>> >> > >>>>>>> >> > On Sun, Feb 15, 2015 at 4:36 PM, vlad.gm@gmail.com < >>>>>>> vlad.gm@gmail.com> >>>>>>> >> > wrote: >>>>>>> >> > >>>>>>> >> >> Looks like the same problem we encountered recently. >>>>>>> >> >> >>>>>>> >> >> Regards, >>>>>>> >> >> Vlad >>>>>>> >> >> On Feb 15, 2015 4:35 PM, "kishore g" >>>>>>> wrote: >>>>>>> >> >> >>>>>>> >> >> > Steph described this problem on IRC. >>>>>>> >> >> > >>>>>>> >> >> > He is using 0.7.1. On connecting to cluster he gets this NPE >>>>>>> >> >> > >>>>>>> >> >> > http://pastebin.com/YE3fwK5i >>>>>>> >> >> > >>>>>>> >> >> > java.lang.NullPointerException >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.(ZkCallbackHandler.java:130) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134) >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > Here is his connection code. >>>>>>> >> >> > >>>>>>> >> >> > http://pastebin.com/QRfVU1tc >>>>>>> >> >> > >>>>>>> >> >> > private static HelixParticipant spinUpParticipant(HelixAdmin >>>>>>> admin, >>>>>>> >> >> > ParticipantId participantId) { >>>>>>> >> >> > LOGGER.info("Starting up "+participantId); >>>>>>> >> >> > HelixConnection connection = new >>>>>>> ZkHelixConnection( >>>>>>> >> >> > ZK_ADDRESS); >>>>>>> >> >> > connection.connect(); >>>>>>> >> >> > HelixParticipant participant = connection. >>>>>>> >> >> > createParticipant(CLUSTER_ID, participantId); >>>>>>> >> >> > StateMachineEngine stateMach = participant. >>>>>>> >> >> > getStateMachineEngine(); >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> StateTransitionHandlerFactory >>>>>>> >> >> > transitionHandlerFactory = new OnlineOfflineHandlerFactory(); >>>>>>> >> >> > >>>>>>> stateMach.registerStateModelFactory(STATE_MODEL_NAME, >>>>>>> >> >> > transitionHandlerFactory); >>>>>>> >> >> > participant.start(); >>>>>>> >> >> > >>>>>>> >> >> > admin.enableInstance(CLUSTER_NAME, >>>>>>> >> >> participantId.toString( >>>>>>> >> >> > ), true); >>>>>>> >> >> > >>>>>>> >> >> > return participant; >>>>>>> >> >> > } >>>>>>> >> >> > >>>>>>> >> >> > Adding 3s sleep after registerStateModelFactory works. Any >>>>>>> idea what >>>>>>> >> is >>>>>>> >> >> > happening. >>>>>>> >> >> > >>>>>>> >> >> > thanks, >>>>>>> >> >> > Kishore G >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >>>>>>> > >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>> >>> --001a113fb866f347a2050f38bfed Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
There is a timing issue in ZkHelixParticipant#setupMsgHand= ler(). We should hook up ZK callback (line 347 in https://github.com/apache/helix/blob/maste= r/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.j= ava) after all message handler registrations are done (line 354 in https://github.com/ap= ache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk= /ZkHelixParticipant.java). Fix is to move adding ZK callback to the end= . Will add a test case that can reliably reproduce this issue.

Thanks,
Zhen


On Sun, Feb 15, 2015 at 11:45 PM, Zh= en Zhang <nehzgnahz@gmail.com> wrote:

might be some race conditions. need to double = check this.

On Feb 15, 2015 11:38 PM, "Steph Meslin-Web= er" <step= h@tangency.co.uk> wrote:

Hi Kishore,

That's right, the node doesn't process any state tra= nsitions. They should have been logged in the first set of logs had they oc= curred.

Thanks,
Steph

On 16 Feb 2015 07:28, "kishore g" <= g.kishore@gmail.co= m> wrote:
Hi Steph,

When the NPE occurs, do you get = the state transition callbacks?

thanks,
= Kishore G



On Sun, Feb 15, 2015 at 11:23 PM, Steph M= eslin-Weber <steph@tangency.co.uk> wrote:

Unfortunately it appears that when the NP= E occurs,=C2=A0 dropping the participant no longer cleans up the related IN= STANCE node. Perhaps some state is lost?

Thanks,
Steph

On 16 Feb 2015 06:52, "Zhen Zhang" <= ;nehzgnahz@gmail.c= om> wrote:
I think the NPE is not fatal. It happens when no message han= dler factory is registered for this message type. The message will not be r= emoved and remain in UNREAD state. Later when the message handler factory i= s registered via:
DefaultMessagingService#registerMessageHandlerFactory= , we will send a NOP message, which will in turn trigger HelixTaskExecutor = to process all UNREAD messages. We should definitely fix this by logging a = warning message instead of throwing an NPE.

Thanks= ,
Jason


=
On Sun, Feb 15, 2015 at 7:30 PM, kishore g <g= .kishore@gmail.com> wrote:
=
Controller assuming the state transition occurred is = even more dangerous.





On Sun, Feb 15, 2015 at 7:18 PM, vlad.gm@gmail.com <vlad.gm@gmail.com> wrote:
In my experience it was fata= l. The callback would jot be called but the
controller would somehow assume the state transition occurred.
On Feb 15, 2015 7:13 PM, "kishore g" <g.kishore@gmail.com> wrot= e:

> Thanks Vlad. That explains the problem. That also explains how adding<= br> > sleep of 3seconds work.
>
> Jason, is this exception fatal?. Will the message be processed again a= fter
> the handler is added.
>
> thanks,
> Kishore G
>
> On Sun, Feb 15, 2015 at 6:41 PM, vlad.gm@gmail.com <vlad.gm@gmail.com>
> wrote:
>
>> https://issues.apache.org/jira/browse/HELIX-548
>> On Feb 15, 2015 6:38 PM, "kishore g" <g.kishore@gmail.com> wrote= :
>>
>> > Hi Vlad,
>> >
>> > Was there any jira associated with it?
>> >
>> > thanks.
>> > Kishore G
>> >
>> > On Sun, Feb 15, 2015 at 4:36 PM, vlad.gm@gmail.com <vlad.gm@gmail.com>
>> > wrote:
>> >
>> >> Looks like the same problem we encountered recently.
>> >>
>> >> Regards,
>> >> Vlad
>> >> On Feb 15, 2015 4:35 PM, "kishore g" <g.kishore@gmail.com&= gt; wrote:
>> >>
>> >> > Steph described this problem on IRC.
>> >> >
>> >> > He is using 0.7.1. On connecting to cluster he gets = this NPE
>> >> >
>> >> > http://pastebin.com/YE3fwK5i
>> >> >
>> >> > java.lang.NullPointerException
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.messaging.handling.HelixTaskExecutor.createMessag= eHandler(HelixTaskExecutor.java:661)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(He= lixTaskExecutor.java:581)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHan= dler.java:202)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandl= er.java:336)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkCallbackHandler.<init>(ZkCallb= ackHandler.java:130)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixC= onnection.java:533)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(Z= kHelixConnection.java:267)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkH= elixParticipant.java:347)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixPartici= pant.java:383)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelix= Participant.java:401)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixPartic= ipant.java:428)
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
>> >> >
>> >>
>> com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServe= rNode.java:134)
>> >> >
>> >> >
>> >> > Here is his connection code.
>> >> >
>> >> > http://pastebin.com/QRfVU1tc
>> >> >
>> >> > private static HelixParticipant spinUpParticipant(He= lixAdmin admin,
>> >> > ParticipantId participantId) {
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0LOGGER.info("Starting up "+participantId);
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0HelixConnection connection =3D new ZkHelixConnection(
>> >> > ZK_ADDRESS);
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0connection.connect();
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0HelixParticipant participant =3D connection.
>> >> > createParticipant(CLUSTER_ID, participantId);
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0StateMachineEngine stateMach =3D participant.
>> >> > getStateMachineEngine();
>> >> >
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0StateTransitionHandlerFactory<LocalTransitionHandler>
>> >> > transitionHandlerFactory =3D new OnlineOfflineHandle= rFactory();
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0stateMach.registerStateModelFactory(STATE_MODEL_NAME,
>> >> > transitionHandlerFactory);
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0participant.start();
>> >> >
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0admin.enableInstance(CLUSTER_NAME,
>> >> participantId.toString(
>> >> > ), true);
>> >> >
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0return participant;
>> >> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
>> >> >
>> >> > Adding 3s sleep after registerStateModelFactory work= s. Any idea what
>> is
>> >> > happening.
>> >> >
>> >> > thanks,
>> >> > Kishore G
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>
>




--001a113fb866f347a2050f38bfed--