helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhen Zhang <nehzgn...@gmail.com>
Subject Re: NPE during start up
Date Mon, 16 Feb 2015 18:28:16 GMT
There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We should
hook up ZK callback (line 347 in
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java)
after all message handler registrations are done (line 354 in
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java).
Fix is to move adding ZK callback to the end. Will add a test case that can
reliably reproduce this issue.

Thanks,
Zhen


On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang <nehzgnahz@gmail.com> wrote:

> might be some race conditions. need to double check this.
> On Feb 15, 2015 11:38 PM, "Steph Meslin-Weber" <steph@tangency.co.uk>
> wrote:
>
>> Hi Kishore,
>>
>> That's right, the node doesn't process any state transitions. They should
>> have been logged in the first set of logs had they occurred.
>>
>> Thanks,
>> Steph
>> On 16 Feb 2015 07:28, "kishore g" <g.kishore@gmail.com> wrote:
>>
>>> Hi Steph,
>>>
>>> When the NPE occurs, do you get the state transition callbacks?
>>>
>>> thanks,
>>> Kishore G
>>>
>>>
>>>
>>> On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber <
>>> steph@tangency.co.uk> wrote:
>>>
>>>> Unfortunately it appears that when the NPE occurs,  dropping the
>>>> participant no longer cleans up the related INSTANCE node. Perhaps some
>>>> state is lost?
>>>>
>>>> Thanks,
>>>> Steph
>>>> On 16 Feb 2015 06:52, "Zhen Zhang" <nehzgnahz@gmail.com> wrote:
>>>>
>>>>> I think the NPE is not fatal. It happens when no message handler
>>>>> factory is registered for this message type. The message will not be
>>>>> removed and remain in UNREAD state. Later when the message handler factory
>>>>> is registered via:
>>>>> DefaultMessagingService#registerMessageHandlerFactory, we will send a
>>>>> NOP message, which will in turn trigger HelixTaskExecutor to process
all
>>>>> UNREAD messages. We should definitely fix this by logging a warning message
>>>>> instead of throwing an NPE.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>>
>>>>>
>>>>> On Sun, Feb 15, 2015 at 7:30 PM, kishore g <g.kishore@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Controller assuming the state transition occurred is even more
>>>>>> dangerous.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 15, 2015 at 7:18 PM, vlad.gm@gmail.com <vlad.gm@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> In my experience it was fatal. The callback would jot be called
but
>>>>>>> the
>>>>>>> controller would somehow assume the state transition occurred.
>>>>>>> On Feb 15, 2015 7:13 PM, "kishore g" <g.kishore@gmail.com>
wrote:
>>>>>>>
>>>>>>> > Thanks Vlad. That explains the problem. That also explains
how
>>>>>>> adding
>>>>>>> > sleep of 3seconds work.
>>>>>>> >
>>>>>>> > Jason, is this exception fatal?. Will the message be processed
>>>>>>> again after
>>>>>>> > the handler is added.
>>>>>>> >
>>>>>>> > thanks,
>>>>>>> > Kishore G
>>>>>>> >
>>>>>>> > On Sun, Feb 15, 2015 at 6:41 PM, vlad.gm@gmail.com <
>>>>>>> vlad.gm@gmail.com>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> >> https://issues.apache.org/jira/browse/HELIX-548
>>>>>>> >> On Feb 15, 2015 6:38 PM, "kishore g" <g.kishore@gmail.com>
wrote:
>>>>>>> >>
>>>>>>> >> > Hi Vlad,
>>>>>>> >> >
>>>>>>> >> > Was there any jira associated with it?
>>>>>>> >> >
>>>>>>> >> > thanks.
>>>>>>> >> > Kishore G
>>>>>>> >> >
>>>>>>> >> > On Sun, Feb 15, 2015 at 4:36 PM, vlad.gm@gmail.com
<
>>>>>>> vlad.gm@gmail.com>
>>>>>>> >> > wrote:
>>>>>>> >> >
>>>>>>> >> >> Looks like the same problem we encountered
recently.
>>>>>>> >> >>
>>>>>>> >> >> Regards,
>>>>>>> >> >> Vlad
>>>>>>> >> >> On Feb 15, 2015 4:35 PM, "kishore g" <g.kishore@gmail.com>
>>>>>>> wrote:
>>>>>>> >> >>
>>>>>>> >> >> > Steph described this problem on IRC.
>>>>>>> >> >> >
>>>>>>> >> >> > He is using 0.7.1. On connecting to cluster
he gets this NPE
>>>>>>> >> >> >
>>>>>>> >> >> > http://pastebin.com/YE3fwK5i
>>>>>>> >> >> >
>>>>>>> >> >> > java.lang.NullPointerException
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.<init>(ZkCallbackHandler.java:130)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428)
>>>>>>> >> >> >         at
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >>
>>>>>>> com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134)
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > Here is his connection code.
>>>>>>> >> >> >
>>>>>>> >> >> > http://pastebin.com/QRfVU1tc
>>>>>>> >> >> >
>>>>>>> >> >> > private static HelixParticipant spinUpParticipant(HelixAdmin
>>>>>>> admin,
>>>>>>> >> >> > ParticipantId participantId) {
>>>>>>> >> >> >                 LOGGER.info("Starting
up "+participantId);
>>>>>>> >> >> >                 HelixConnection connection
= new
>>>>>>> ZkHelixConnection(
>>>>>>> >> >> > ZK_ADDRESS);
>>>>>>> >> >> >                 connection.connect();
>>>>>>> >> >> >                 HelixParticipant participant
= connection.
>>>>>>> >> >> > createParticipant(CLUSTER_ID, participantId);
>>>>>>> >> >> >                 StateMachineEngine stateMach
= participant.
>>>>>>> >> >> > getStateMachineEngine();
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>>  StateTransitionHandlerFactory<LocalTransitionHandler>
>>>>>>> >> >> > transitionHandlerFactory = new OnlineOfflineHandlerFactory();
>>>>>>> >> >> >
>>>>>>>  stateMach.registerStateModelFactory(STATE_MODEL_NAME,
>>>>>>> >> >> > transitionHandlerFactory);
>>>>>>> >> >> >                 participant.start();
>>>>>>> >> >> >
>>>>>>> >> >> >                 admin.enableInstance(CLUSTER_NAME,
>>>>>>> >> >> participantId.toString(
>>>>>>> >> >> > ), true);
>>>>>>> >> >> >
>>>>>>> >> >> >                 return participant;
>>>>>>> >> >> >         }
>>>>>>> >> >> >
>>>>>>> >> >> > Adding 3s sleep after registerStateModelFactory
works. Any
>>>>>>> idea what
>>>>>>> >> is
>>>>>>> >> >> > happening.
>>>>>>> >> >> >
>>>>>>> >> >> > thanks,
>>>>>>> >> >> > Kishore G
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>

Mime
View raw message