From Zhen Zhang <zzh...@linkedin.com>
Subject RE: Helix issue - External View out of sync
Date Tue, 18 Nov 2014 23:06:59 GMT
Hi Varun,

Here is the problem. You are using ONLINE-OFFLINE state model for multiple resources, and
in this case when you register state model factory, you need to use your resource name (e.g.
$terrapin$data$meta_pin_join$1415866960201) as your factory name instead of using the default
factory name (which is "DEFAULT"); sth. like this:

HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE", factory, "$terrapin$data$meta_pin_join$1415866960201")

Otherwise, Helix can't distinguish the state model factories for the two different resources
using the same state model and the same factory name. To confirm, you should have the following
message in your participant log:

WARN: "stateModelFactory for " + stateModelName + " using factoryName DEFAULT has already
been registered."

Let us know if this solves the problem.


From: Varun Sharma
Sent: Tuesday, November 18, 2014 12:59 PM
To: user@helix.apache.org
Subject: Re: Helix issue - External View out of sync

I shared the logs with zhen using google drive..

On Tue, Nov 18, 2014 at 12:56 PM, kishore g wrote:
Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma wrote:
Hi Zhen,

My logs are > 10M and jira does not allow me to attach them. Also, gmail is not allowing
me to send them over as it flags them as "blocked for security reasons" - link here<https://support.google.com/mail/answer/6590?hl=en>
- Do you have any other options to send over the file. I create HELIX-551 for this issue.


On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang wrote:
Hi Varun, I missed the conversation on IRC. You could create a jira at:

And attach the zk log in the jira. We will be able to figure it out.


From: Zhen Zhang
Sent: Monday, November 17, 2014 5:16 PM
Sent: Monday, November 17, 2014 5:16 PM
To: user@helix.apache.org<mailto:user@helix.apache.org>
Subject: RE: Helix issue - External View out of sync

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html


From: Varun Sharma
Sent: Monday, November 17, 2014 5:08 PM
Sent: Monday, November 17, 2014 5:08 PM
To: user@helix.apache.org<mailto:user@helix.apache.org>
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events around the
same time.

Is there anything else specifically I shold look for in the logs ? Is there a way to find
out whether a node was removed from the cluster due to a ZK issue ?

Thanks !

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma wrote:
I am wondering how come a partition was in the online state for a resource that was newly


On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions
onlined themselves. It seems for that node - it opened a whole bunch of other partitions at
around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand
new resource I created..

THanks !

On Mon, Nov 17, 2014 at 4:24 PM, kishore g wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship
if they get into long GC or starts flapping. That might be cause for state mismatch. However,
external view must be up to date. It might help if you can attach the controller logs and
node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma wrote:

I am seeing the following issue for many partitions in helix using a simple Online->Offline
state model factory. The external view says that the partition has been assigned to 3 hosts.
However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force
CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE,
message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090,
TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141"
: {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db,
errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
Current state of stateModel does not match the fromState in Message, Current State:ONLINE,
message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"


What could be causing this - when I restart the node, the error disappears (meaning that the
node is able to perform the state transition). What could be causing this state mismatch ?



