helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhen Zhang <nehzgn...@gmail.com>
Subject Re: NPE during start up
Date Mon, 16 Feb 2015 22:13:05 GMT
I don't think it's fatal. When NPE happens, the messages will be marked as
UNPROCESSABLE and removed. All state transitions should still happen when
later message handler factory is registered. Controller will resend all
transitions. The error messages are harmless.

I also tried drop instance. It seems working fine. When to drop an
instance, remember to first disable the instance and then stop the
instance; otherwise, some states may still be remaining on zookeeper.


On Feb 16, 2015 11:36 AM, "kishore g" <g.kishore@gmail.com> wrote:

> Is there any work around for this and is this fatal as Vlad mentioned?
>
> On Mon, Feb 16, 2015 at 10:28 AM, Zhen Zhang <nehzgnahz@gmail.com> wrote:
>
> > There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We
> should
> > hook up ZK callback (line 347 in
> >
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java
> )
> > after all message handler registrations are done (line 354 in
> >
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java
> ).
> > Fix is to move adding ZK callback to the end. Will add a test case that
> can
> > reliably reproduce this issue.
> >
> > Thanks,
> > Zhen
> >
> >
> > On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang <nehzgnahz@gmail.com>
> wrote:
> >
> >> might be some race conditions. need to double check this.
> >> On Feb 15, 2015 11:38 PM, "Steph Meslin-Weber" <steph@tangency.co.uk>
> >> wrote:
> >>
> >>> Hi Kishore,
> >>>
> >>> That's right, the node doesn't process any state transitions. They
> >>> should have been logged in the first set of logs had they occurred.
> >>>
> >>> Thanks,
> >>> Steph
> >>> On 16 Feb 2015 07:28, "kishore g" <g.kishore@gmail.com> wrote:
> >>>
> >>>> Hi Steph,
> >>>>
> >>>> When the NPE occurs, do you get the state transition callbacks?
> >>>>
> >>>> thanks,
> >>>> Kishore G
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber <
> >>>> steph@tangency.co.uk> wrote:
> >>>>
> >>>>> Unfortunately it appears that when the NPE occurs,  dropping the
> >>>>> participant no longer cleans up the related INSTANCE node. Perhaps
> some
> >>>>> state is lost?
> >>>>>
> >>>>> Thanks,
> >>>>> Steph
> >>>>> On 16 Feb 2015 06:52, "Zhen Zhang" <nehzgnahz@gmail.com> wrote:
> >>>>>
> >>>>>> I think the NPE is not fatal. It happens when no message handler
> >>>>>> factory is registered for this message type. The message will
not be
> >>>>>> removed and remain in UNREAD state. Later when the message handler
> factory
> >>>>>> is registered via:
> >>>>>> DefaultMessagingService#registerMessageHandlerFactory, we will
send
> a
> >>>>>> NOP message, which will in turn trigger HelixTaskExecutor to
> process all
> >>>>>> UNREAD messages. We should definitely fix this by logging a
warning
> message
> >>>>>> instead of throwing an NPE.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jason
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Feb 15, 2015 at 7:30 PM, kishore g <g.kishore@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Controller assuming the state transition occurred is even
more
> >>>>>>> dangerous.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Feb 15, 2015 at 7:18 PM, vlad.gm@gmail.com <
> >>>>>>> vlad.gm@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> In my experience it was fatal. The callback would jot
be called
> but
> >>>>>>>> the
> >>>>>>>> controller would somehow assume the state transition
occurred.
> >>>>>>>> On Feb 15, 2015 7:13 PM, "kishore g" <g.kishore@gmail.com>
wrote:
> >>>>>>>>
> >>>>>>>> > Thanks Vlad. That explains the problem. That also
explains how
> >>>>>>>> adding
> >>>>>>>> > sleep of 3seconds work.
> >>>>>>>> >
> >>>>>>>> > Jason, is this exception fatal?. Will the message
be processed
> >>>>>>>> again after
> >>>>>>>> > the handler is added.
> >>>>>>>> >
> >>>>>>>> > thanks,
> >>>>>>>> > Kishore G
> >>>>>>>> >
> >>>>>>>> > On Sun, Feb 15, 2015 at 6:41 PM, vlad.gm@gmail.com
<
> >>>>>>>> vlad.gm@gmail.com>
> >>>>>>>> > wrote:
> >>>>>>>> >
> >>>>>>>> >> https://issues.apache.org/jira/browse/HELIX-548
> >>>>>>>> >> On Feb 15, 2015 6:38 PM, "kishore g" <g.kishore@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> >>
> >>>>>>>> >> > Hi Vlad,
> >>>>>>>> >> >
> >>>>>>>> >> > Was there any jira associated with it?
> >>>>>>>> >> >
> >>>>>>>> >> > thanks.
> >>>>>>>> >> > Kishore G
> >>>>>>>> >> >
> >>>>>>>> >> > On Sun, Feb 15, 2015 at 4:36 PM, vlad.gm@gmail.com
<
> >>>>>>>> vlad.gm@gmail.com>
> >>>>>>>> >> > wrote:
> >>>>>>>> >> >
> >>>>>>>> >> >> Looks like the same problem we encountered
recently.
> >>>>>>>> >> >>
> >>>>>>>> >> >> Regards,
> >>>>>>>> >> >> Vlad
> >>>>>>>> >> >> On Feb 15, 2015 4:35 PM, "kishore
g" <g.kishore@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> >> >>
> >>>>>>>> >> >> > Steph described this problem
on IRC.
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > He is using 0.7.1. On connecting
to cluster he gets this
> NPE
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > http://pastebin.com/YE3fwK5i
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > java.lang.NullPointerException
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkCallbackHandler.<init>(ZkCallbackHandler.java:130)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428)
> >>>>>>>> >> >> >         at
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >>
> >>>>>>>>
> com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134)
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > Here is his connection code.
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > http://pastebin.com/QRfVU1tc
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > private static HelixParticipant
> >>>>>>>> spinUpParticipant(HelixAdmin admin,
> >>>>>>>> >> >> > ParticipantId participantId)
{
> >>>>>>>> >> >> >                 LOGGER.info("Starting
up "+participantId);
> >>>>>>>> >> >> >                 HelixConnection
connection = new
> >>>>>>>> ZkHelixConnection(
> >>>>>>>> >> >> > ZK_ADDRESS);
> >>>>>>>> >> >> >                 connection.connect();
> >>>>>>>> >> >> >                 HelixParticipant
participant = connection.
> >>>>>>>> >> >> > createParticipant(CLUSTER_ID,
participantId);
> >>>>>>>> >> >> >                 StateMachineEngine
stateMach =
> participant.
> >>>>>>>> >> >> > getStateMachineEngine();
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >
> >>>>>>>>  StateTransitionHandlerFactory<LocalTransitionHandler>
> >>>>>>>> >> >> > transitionHandlerFactory = new
> >>>>>>>> OnlineOfflineHandlerFactory();
> >>>>>>>> >> >> >
> >>>>>>>>  stateMach.registerStateModelFactory(STATE_MODEL_NAME,
> >>>>>>>> >> >> > transitionHandlerFactory);
> >>>>>>>> >> >> >                 participant.start();
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >                 admin.enableInstance(CLUSTER_NAME,
> >>>>>>>> >> >> participantId.toString(
> >>>>>>>> >> >> > ), true);
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >                 return participant;
> >>>>>>>> >> >> >         }
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > Adding 3s sleep after registerStateModelFactory
works. Any
> >>>>>>>> idea what
> >>>>>>>> >> is
> >>>>>>>> >> >> > happening.
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > thanks,
> >>>>>>>> >> >> > Kishore G
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >> >
> >>>>>>>> >> >
> >>>>>>>> >>
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >
>

Mime
View raw message