helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Helix issue - External View out of sync
Date Tue, 18 Nov 2014 20:56:30 GMT
Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <varun@pinterest.com> wrote:

> Hi Zhen,
>
> My logs are > 10M and jira does not allow me to attach them. Also, gmail
> is not allowing me to send them over as it flags them as "blocked for
> security reasons" - link here
> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any
> other options to send over the file. I create HELIX-551 for this issue.
>
> Thanks
> Varun
>
> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zzhang@linkedin.com> wrote:
>
>>  Hi Varun, I missed the conversation on IRC. You could create a jira at:
>> https://issues.apache.org/jira/browse/HELIX
>>
>> And attach the zk log in the jira. We will be able to figure it out.
>>
>> Thanks,
>> Zhen
>>
>>  ------------------------------
>> *From:* Zhen Zhang [zzhang@linkedin.com]
>> *Sent:* Monday, November 17, 2014 5:16 PM
>> *To:* user@helix.apache.org
>> *Subject:* RE: Helix issue - External View out of sync
>>
>>   Hi, Varun, you can join us on freenode IRC:
>> http://helix.apache.org/IRC.html
>>
>> Thanks,
>> Zhen
>>
>>  ------------------------------
>> *From:* Varun Sharma [varun@pinterest.com]
>> *Sent:* Monday, November 17, 2014 5:08 PM
>> *To:* user@helix.apache.org
>> *Subject:* Re: Helix issue - External View out of sync
>>
>>   I looked at the logs and gc was fine as the system was processing
>> other events around the same time.
>>
>>  Is there anything else specifically I shold look for in the logs ? Is
>> there a way to find out whether a node was removed from the cluster due to
>> a ZK issue ?
>>
>>  Thanks !
>> Varun
>>
>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <varun@pinterest.com>
>> wrote:
>>
>>> I am wondering how come a partition was in the online state for a
>>> resource that was newly created.
>>>
>>>  Thanks
>>>  Varun
>>>
>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <varun@pinterest.com>
>>> wrote:
>>>
>>>> I am using 0.6.4. In this case, I created a resource and set its ideal
>>>> state and the partitions onlined themselves. It seems for that node - it
>>>> opened a whole bunch of other partitions at around the same time (~ 30 or
>>>> so) but failed to open 3-4 partitions. This was for a brand new resource
I
>>>> created..
>>>>
>>>>  THanks !
>>>>  Varun
>>>>
>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g.kishore@gmail.com> wrote:
>>>>
>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>>>>> cluster member ship if they get into long GC or starts flapping. That
might
>>>>> be cause for state mismatch. However, external view must be up to date.
It
>>>>> might help if you can attach the controller logs and node logs.
>>>>>
>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <varun@pinterest.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I am seeing the following issue for many partitions in helix using
>>>>>> a simple Online->Offline state model factory. The external view
says that
>>>>>> the partition has been assigned to 3 hosts. However, when I look
at the
>>>>>> hosts only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>>
>>>>>>  On the hosts, that did not execute the transition, I see the
>>>>>> following:
>>>>>>
>>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState
on
>>>>>> Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>>> READ_TIMESTAMP=1415870993787,
>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>>> TO_STATE=ONLINE}{}{}
>>>>>>
>>>>>>  When I grep the message ID in the controller, I see the following:
>>>>>>
>>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>>
>>>>>>   "id" :
>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>>
>>>>>>   "mapFields" : {
>>>>>>
>>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>>
>>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>>> org.apache.helix.messaging.handling.
>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>>> state of stateModel does not match the fromState in Message, Current
>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>>
>>>>>>       "Class" : "class
>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>>
>>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>>
>>>>>>       "Message state" : "READ"
>>>>>>
>>>>>>     },
>>>>>>
>>>>>>
>>>>>>  What could be causing this - when I restart the node, the error
>>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>>> What could be causing this state mismatch ?
>>>>>>
>>>>>>
>>>>>>  Thanks
>>>>>>
>>>>>> Varun
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message