Return-Path: X-Original-To: apmail-helix-user-archive@minotaur.apache.org Delivered-To: apmail-helix-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 944A010A48 for ; Tue, 18 Nov 2014 00:32:37 +0000 (UTC) Received: (qmail 6445 invoked by uid 500); 18 Nov 2014 00:32:37 -0000 Delivered-To: apmail-helix-user-archive@helix.apache.org Received: (qmail 6396 invoked by uid 500); 18 Nov 2014 00:32:37 -0000 Mailing-List: contact user-help@helix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@helix.apache.org Delivered-To: mailing list user@helix.apache.org Received: (qmail 6386 invoked by uid 99); 18 Nov 2014 00:32:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 00:32:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,LOTS_OF_MONEY,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of varun@pinterest.com designates 209.85.223.181 as permitted sender) Received: from [209.85.223.181] (HELO mail-ie0-f181.google.com) (209.85.223.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 00:32:10 +0000 Received: by mail-ie0-f181.google.com with SMTP id tp5so3548200ieb.40 for ; Mon, 17 Nov 2014 16:31:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pinterest.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=NHwnTY3qd7+cG0CxjqDYNx3jp0rgnUeZDWykMTk6AwU=; b=iJqhUqiRLLeLj6ShfnFr63hgDUZO52Zb7vZ+khXKhv7DdMldWlA6UYXBntZ4dTeKoJ LGsnkFHkS7mER2YgNsqNPEX+WOFBLabGVJiun7/y8pMvBxga8Oytey7+WqM/DXtJ9FQ8 XiEqmS5IORTfIqRMf+EXl6fBwcgsCM+Z29ePg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=NHwnTY3qd7+cG0CxjqDYNx3jp0rgnUeZDWykMTk6AwU=; b=KN+xl7woquSRkhFGeL/LEWVet8Hz9GmYTKcYhppgDDweVPByTA0U2SOSiBsFIFQ8y2 EBRI5q+npJj1rCKc0xrsHX1KiHZbHkyPYjD55yn+EV+8L4IW4YhXZtmIBUcCKVjG1jhE ii0wI0FBEGnm3LGuXmPSpCFOdK7UBzhVcdw0INLSRKGmAj7CZzZ8QITwwjMjG+wqVRJ6 ssLdT4PCIwVzqt6zzgEnn/5XCssLCVwxD93OksEIlu8PGGA5pYbEIx7ed0QaPtB23QWq TGWtVVMGV7FnN7Gz4TTk84CbyRVIwggvEfdQOgqynhRwfbpd3218myE+Fk7yK2OpLX8B Slpw== X-Gm-Message-State: ALoCoQn/znU8BO/9ihFPT2vVfWWbAPkYX32iu2/tFtPj+OF8TcH3WTxb69bxPiZMzD9JuLsGJBAl MIME-Version: 1.0 X-Received: by 10.50.119.195 with SMTP id kw3mr29638376igb.5.1416270684131; Mon, 17 Nov 2014 16:31:24 -0800 (PST) Received: by 10.107.31.5 with HTTP; Mon, 17 Nov 2014 16:31:24 -0800 (PST) In-Reply-To: References: Date: Mon, 17 Nov 2014 16:31:24 -0800 Message-ID: Subject: Re: Helix issue - External View out of sync From: Varun Sharma To: user@helix.apache.org Content-Type: multipart/alternative; boundary=001a11348aaa08d16e0508173784 X-Virus-Checked: Checked by ClamAV on apache.org --001a11348aaa08d16e0508173784 Content-Type: text/plain; charset=UTF-8 I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created.. THanks ! Varun On Mon, Nov 17, 2014 at 4:24 PM, kishore g wrote: > One suggestion is to check for GC pauses on the nodes. Nodes loses the > cluster member ship if they get into long GC or starts flapping. That might > be cause for state mismatch. However, external view must be up to date. It > might help if you can attach the controller logs and node logs. > > On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma wrote: > >> Hi, >> >> I am seeing the following issue for many partitions in helix using a >> simple Online->Offline state model factory. The external view says that the >> partition has been assigned to 3 hosts. However, when I look at the hosts >> only 1 of them executed the OFFLINE --> ONLINE transition. >> >> On the hosts, that did not execute the transition, I see the following: >> >> 2014-11-13 09:29:54,394 [pool-3-thread-11] >> (HelixStateTransitionHandler.java:206) WARN *Force CurrentState on Zk >> to be stateModel's CurrentState*. *partitionKey: 490*, currentState: >> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, >> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, >> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, >> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*, >> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, >> READ_TIMESTAMP=1415870993787, >> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, >> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, >> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, >> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, >> TO_STATE=ONLINE}{}{} >> >> When I grep the message ID in the controller, I see the following: >> >> 2014-11-14 09:34:56,265 [StatusDumpTimerTask] >> (ZKPathDataDumpTask.java:155) INFO { >> >> "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201", >> >> "mapFields" : { >> >> "HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION >> c1193025-b416-49d7-adc2-10afe2389141" : { >> >> "AdditionalInfo" : "Message execution failed. msgId: >> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: >> org.apache.helix.messaging.handling. >> *HelixStateTransitionHandler$HelixStateMismatchException*: Current state >> of stateModel does not match the fromState in Message, Current >> State:ONLINE, message expected:OFFLINE, partition: 490, from: >> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256", >> >> "Class" : "class >> org.apache.helix.messaging.handling.HelixStateTransitionHandler", >> >> "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db", >> >> "Message state" : "READ" >> >> }, >> >> >> What could be causing this - when I restart the node, the error >> disappears (meaning that the node is able to perform the state transition). >> What could be causing this state mismatch ? >> >> >> Thanks >> >> Varun >> > > --001a11348aaa08d16e0508173784 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I am using 0.6.4. In this case, I created a resource and s= et its ideal state and the partitions onlined themselves. It seems for that= node - it opened a whole bunch of other partitions at around the same time= (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new r= esource I created..

THanks !
Varun
=

On Mon, Nov 17, 2= 014 at 4:24 PM, kishore g <g.kishore@gmail.com> wrote:
=
One suggestion is to check = for GC pauses on the nodes. Nodes loses the cluster member ship if they get= into long GC or starts flapping. That might be cause for state mismatch. H= owever, external view must be up to date. It might help if you can attach t= he controller logs and node logs.

On Mon, = Nov 17, 2014 at 4:10 PM, Varun Sharma <varun@pinterest.com> wrote:
Hi,

I am seeing the following issue for many partitions in helix using = a simple Online->Offline state model factory. The external view says tha= t the partition has been assigned to 3 hosts. However, when I look at the h= osts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see t= he following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandle= r.java:206) WARN=C2=A0 Force CurrentState on Zk to be stateModel's C= urrentState. partitionKey: 490, currentState: ONLINE, message: 1= 2690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=3D1415870993349, Clu= sterEventName=3DidealStateChange, EXECUTE_START_TIMESTAMP=3D1415870994382, = EXE_SESSION_ID=3D149a14ada0d0013, FROM_STATE=3DOFFLINE, MSG_ID=3D12690ce= 8-8098-46b1-a93d-279604f0e3db, MSG_STATE=3Dread, MSG_TYPE=3DSTATE_TRANS= ITION, PARTITION_NAME=3D490, READ_TIMESTAMP=3D1415870993787, RESOURCE_NAME= =3D$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=3Dhdfsterrapin-a-na= menode001_9090, SRC_SESSION_ID=3D147a7beb2dd8ed7, STATE_MODEL_DEF=3DOnlineO= ffline, STATE_MODEL_FACTORY_NAME=3DDEFAULT, TGT_NAME=3Dhdfsterrapin-a-datan= ode-ba3ad256, TGT_SESSION_ID=3D149a14ada0d0013, TO_STATE=3DONLINE}{}{}=C2= =A0

When I grep the message ID in the controller, = I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:1= 55) INFO=C2=A0 {

=C2=A0 "id" : "149a14ada0d0013__$terrapin$data$meta_pin_j= oin$1415866960201",

=C2=A0 "mapFields" : {

=C2=A0 =C2=A0 "HELIX_ERROR =C2=A0 =C2=A0 20141113-092954.000419 STA= TE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

=C2=A0 =C2=A0 =C2=A0 "AdditionalInfo" : "Message executio= n failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache= .helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatch= Exception: Current state of stateModel does not match the fromState in = Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, fr= om: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256&q= uot;,

=C2=A0 =C2=A0 =C2=A0 "Class" : "class org.apache.helix.me= ssaging.handling.HelixStateTransitionHandler",

=C2=A0 =C2=A0 =C2=A0 "MSG_ID" : "12690ce8-8098-46b1-a93d-= 279604f0e3db",

=C2=A0 =C2=A0 =C2=A0 "Message state" : "READ"

=C2=A0 =C2=A0 },


What could be causing this - when I re= start the node, the error disappears (meaning that the node is able to perf= orm the state transition). What could be causing this state mismatch ?

<= p>

Thanks

Varun



--001a11348aaa08d16e0508173784--