Return-Path: X-Original-To: apmail-helix-user-archive@minotaur.apache.org Delivered-To: apmail-helix-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E944710B7A for ; Tue, 18 Nov 2014 01:09:26 +0000 (UTC) Received: (qmail 92835 invoked by uid 500); 18 Nov 2014 01:09:26 -0000 Delivered-To: apmail-helix-user-archive@helix.apache.org Received: (qmail 92782 invoked by uid 500); 18 Nov 2014 01:09:26 -0000 Mailing-List: contact user-help@helix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@helix.apache.org Delivered-To: mailing list user@helix.apache.org Received: (qmail 92767 invoked by uid 99); 18 Nov 2014 01:09:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 01:09:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,LOTS_OF_MONEY,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of varun@pinterest.com designates 209.85.213.182 as permitted sender) Received: from [209.85.213.182] (HELO mail-ig0-f182.google.com) (209.85.213.182) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 01:09:00 +0000 Received: by mail-ig0-f182.google.com with SMTP id hn15so3133094igb.15 for ; Mon, 17 Nov 2014 17:08:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pinterest.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=KggfsxC4fzG8W//zZWnshWbRYZxTiC7VY/TQ4aQrTzA=; b=T/ZaAzAloXg12xqsxlw6pPX+ufeAk1p108e09o2IWjjFho+aXQXgd6r/9uaT+EeIRL /z2gpDTo4YLLAnySDEhUR9xG8tmEl/F2nvGENuIeSX6rTiR+ZGFJu+4u0rGG5InxiMM4 Bad9DHF5Hghkm2VbIk8hS6wouAgze6BuL6Jko= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=KggfsxC4fzG8W//zZWnshWbRYZxTiC7VY/TQ4aQrTzA=; b=mlPQ9PZ+Dv3KyhMxv7Bk/t6VKkBeAg1HHum5M4zAT47xpLeCuA7czwmuojFOOEiFHU X8H35xj5z0W2aVg2XfjzAk0Db7Gj+ZyTkxrEcPhyCRuyqTurrqKT3gun3hQq2YSToIqp JdPqMnEUIg9DF0DZVQ9hskPGGe0ZWvuRrFk0a+THegrTODXZCReQyEX5bkthUutmb2u2 hTfot749rgAvC5krbDnLKZJ5tlF12e0ZW09WD3JL+8FTu7ZybegJ7W4JBTrVZ2Cn5KUS 1vw7clGwh7saOdDv9R80NEQex+gbqzhIiqD5+V60UxAq2okr7fhyQBT+qFlj22zsQG7X weug== X-Gm-Message-State: ALoCoQlNo2CklzouJBnBpTuCLYyr1IGERypNCvNcJLS8AsqI24skQAgCkru5X9ooeGa0AjBjaRDC MIME-Version: 1.0 X-Received: by 10.50.143.73 with SMTP id sc9mr395094igb.27.1416272893999; Mon, 17 Nov 2014 17:08:13 -0800 (PST) Received: by 10.107.31.5 with HTTP; Mon, 17 Nov 2014 17:08:13 -0800 (PST) In-Reply-To: References: Date: Mon, 17 Nov 2014 17:08:13 -0800 Message-ID: Subject: Re: Helix issue - External View out of sync From: Varun Sharma To: user@helix.apache.org Content-Type: multipart/alternative; boundary=001a1134cb0ec09c67050817bafa X-Virus-Checked: Checked by ClamAV on apache.org --001a1134cb0ec09c67050817bafa Content-Type: text/plain; charset=UTF-8 I looked at the logs and gc was fine as the system was processing other events around the same time. Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ? Thanks ! Varun On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma wrote: > I am wondering how come a partition was in the online state for a resource > that was newly created. > > Thanks > Varun > > On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma wrote: > >> I am using 0.6.4. In this case, I created a resource and set its ideal >> state and the partitions onlined themselves. It seems for that node - it >> opened a whole bunch of other partitions at around the same time (~ 30 or >> so) but failed to open 3-4 partitions. This was for a brand new resource I >> created.. >> >> THanks ! >> Varun >> >> On Mon, Nov 17, 2014 at 4:24 PM, kishore g wrote: >> >>> One suggestion is to check for GC pauses on the nodes. Nodes loses the >>> cluster member ship if they get into long GC or starts flapping. That might >>> be cause for state mismatch. However, external view must be up to date. It >>> might help if you can attach the controller logs and node logs. >>> >>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma >>> wrote: >>> >>>> Hi, >>>> >>>> I am seeing the following issue for many partitions in helix using a >>>> simple Online->Offline state model factory. The external view says that the >>>> partition has been assigned to 3 hosts. However, when I look at the hosts >>>> only 1 of them executed the OFFLINE --> ONLINE transition. >>>> >>>> On the hosts, that did not execute the transition, I see the following: >>>> >>>> 2014-11-13 09:29:54,394 [pool-3-thread-11] >>>> (HelixStateTransitionHandler.java:206) WARN *Force CurrentState on Zk >>>> to be stateModel's CurrentState*. *partitionKey: 490*, currentState: >>>> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, >>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, >>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, >>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*, >>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, >>>> READ_TIMESTAMP=1415870993787, >>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, >>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, >>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, >>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, >>>> TO_STATE=ONLINE}{}{} >>>> >>>> When I grep the message ID in the controller, I see the following: >>>> >>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask] >>>> (ZKPathDataDumpTask.java:155) INFO { >>>> >>>> "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201", >>>> >>>> "mapFields" : { >>>> >>>> "HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION >>>> c1193025-b416-49d7-adc2-10afe2389141" : { >>>> >>>> "AdditionalInfo" : "Message execution failed. msgId: >>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: >>>> org.apache.helix.messaging.handling. >>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current >>>> state of stateModel does not match the fromState in Message, Current >>>> State:ONLINE, message expected:OFFLINE, partition: 490, from: >>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256", >>>> >>>> "Class" : "class >>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler", >>>> >>>> "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db", >>>> >>>> "Message state" : "READ" >>>> >>>> }, >>>> >>>> >>>> What could be causing this - when I restart the node, the error >>>> disappears (meaning that the node is able to perform the state transition). >>>> What could be causing this state mismatch ? >>>> >>>> >>>> Thanks >>>> >>>> Varun >>>> >>> >>> >> > --001a1134cb0ec09c67050817bafa Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I looked at the logs and gc was fine as the system wa= s processing other events around the same time.

Is= there anything else specifically I shold look for in the logs ? Is there a= way to find out whether a node was removed from the cluster due to a ZK is= sue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4= :32 PM, Varun Sharma <varun@pinterest.com> wrote:
I am wondering how come a partit= ion was in the online state for a resource that was newly created.

=
Thanks
= Varun

On Mon, Nov 17, 2014 = at 4:31 PM, Varun Sharma <varun@pinterest.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
I am using 0.6.4. In this ca= se, I created a resource and set its ideal state and the partitions onlined= themselves. It seems for that node - it opened a whole bunch of other part= itions at around the same time (~ 30 or so) but failed to open 3-4 partitio= ns. This was for a brand new resource I created..

THanks= !
Varun
<= div>

On Mon, = Nov 17, 2014 at 4:24 PM, kishore g <g.kishore@gmail.com> w= rote:
One suggestion is = to check for GC pauses on the nodes. Nodes loses the cluster member ship if= they get into long GC or starts flapping. That might be cause for state mi= smatch. However, external view must be up to date. It might help if you can= attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:1= 0 PM, Varun Sharma <varun@pinterest.com> wrote:
Hi,

I am seeing= the following issue for many partitions in helix using a simple Online->= ;Offline state model factory. The external view says that the partition has= been assigned to 3 hosts. However, when I look at the hosts only 1 of them= executed the OFFLINE --> ONLINE transition.

On= the hosts, that did not execute the transition, I see the following:
=

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandle= r.java:206) WARN=C2=A0 Force CurrentState on Zk to be stateModel's C= urrentState. partitionKey: 490, currentState: ONLINE, message: 1= 2690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=3D1415870993349, Clu= sterEventName=3DidealStateChange, EXECUTE_START_TIMESTAMP=3D1415870994382, = EXE_SESSION_ID=3D149a14ada0d0013, FROM_STATE=3DOFFLINE, MSG_ID=3D12690ce= 8-8098-46b1-a93d-279604f0e3db, MSG_STATE=3Dread, MSG_TYPE=3DSTATE_TRANS= ITION, PARTITION_NAME=3D490, READ_TIMESTAMP=3D1415870993787, RESOURCE_NAME= =3D$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=3Dhdfsterrapin-a-na= menode001_9090, SRC_SESSION_ID=3D147a7beb2dd8ed7, STATE_MODEL_DEF=3DOnlineO= ffline, STATE_MODEL_FACTORY_NAME=3DDEFAULT, TGT_NAME=3Dhdfsterrapin-a-datan= ode-ba3ad256, TGT_SESSION_ID=3D149a14ada0d0013, TO_STATE=3DONLINE}{}{}=C2= =A0

When I grep the message ID in the controller, = I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:1= 55) INFO=C2=A0 {

=C2=A0 "id" : "149a14ada0d0013__$terrapin$data$meta_pin_j= oin$1415866960201",

=C2=A0 "mapFields" : {

=C2=A0 =C2=A0 "HELIX_ERROR =C2=A0 =C2=A0 20141113-092954.000419 STA= TE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

=C2=A0 =C2=A0 =C2=A0 "AdditionalInfo" : "Message executio= n failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache= .helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatch= Exception: Current state of stateModel does not match the fromState in = Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, fr= om: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256&q= uot;,

=C2=A0 =C2=A0 =C2=A0 "Class" : "class org.apache.helix.me= ssaging.handling.HelixStateTransitionHandler",

=C2=A0 =C2=A0 =C2=A0 "MSG_ID" : "12690ce8-8098-46b1-a93d-= 279604f0e3db",

=C2=A0 =C2=A0 =C2=A0 "Message state" : "READ"

=C2=A0 =C2=A0 },


What could be causing this - when I re= start the node, the error disappears (meaning that the node is able to perf= orm the state transition). What could be causing this state mismatch ?

<= p>

Thanks

Varun





--001a1134cb0ec09c67050817bafa--