helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Helix issue - External View out of sync
Date Tue, 18 Nov 2014 01:08:13 GMT
I looked at the logs and gc was fine as the system was processing other
events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there
a way to find out whether a node was removed from the cluster due to a ZK
issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <varun@pinterest.com> wrote:

> I am wondering how come a partition was in the online state for a resource
> that was newly created.
>
> Thanks
> Varun
>
> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <varun@pinterest.com> wrote:
>
>> I am using 0.6.4. In this case, I created a resource and set its ideal
>> state and the partitions onlined themselves. It seems for that node - it
>> opened a whole bunch of other partitions at around the same time (~ 30 or
>> so) but failed to open 3-4 partitions. This was for a brand new resource I
>> created..
>>
>> THanks !
>> Varun
>>
>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g.kishore@gmail.com> wrote:
>>
>>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>>> cluster member ship if they get into long GC or starts flapping. That might
>>> be cause for state mismatch. However, external view must be up to date. It
>>> might help if you can attach the controller logs and node logs.
>>>
>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <varun@pinterest.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am seeing the following issue for many partitions in helix using a
>>>> simple Online->Offline state model factory. The external view says that
the
>>>> partition has been assigned to 3 hosts. However, when I look at the hosts
>>>> only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>
>>>> On the hosts, that did not execute the transition, I see the following:
>>>>
>>>> 2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on Zk
>>>> to be stateModel's CurrentState*. *partitionKey: 490*, currentState:
>>>> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>> READ_TIMESTAMP=1415870993787,
>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>> TO_STATE=ONLINE}{}{}
>>>>
>>>> When I grep the message ID in the controller, I see the following:
>>>>
>>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>
>>>>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>
>>>>   "mapFields" : {
>>>>
>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>
>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>> org.apache.helix.messaging.handling.
>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>> state of stateModel does not match the fromState in Message, Current
>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>
>>>>       "Class" : "class
>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>
>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>
>>>>       "Message state" : "READ"
>>>>
>>>>     },
>>>>
>>>>
>>>> What could be causing this - when I restart the node, the error
>>>> disappears (meaning that the node is able to perform the state transition).
>>>> What could be causing this state mismatch ?
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Varun
>>>>
>>>
>>>
>>
>

Mime
View raw message