Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 62067 invoked from network); 9 Feb 2011 06:08:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Feb 2011 06:08:39 -0000 Received: (qmail 31443 invoked by uid 500); 9 Feb 2011 06:08:36 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 31357 invoked by uid 500); 9 Feb 2011 06:08:34 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 31345 invoked by uid 99); 9 Feb 2011 06:08:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Feb 2011 06:08:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a79.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Feb 2011 06:08:26 +0000 Received: from homiemail-a79.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a79.g.dreamhost.com (Postfix) with ESMTP id 2DCDA7D4059 for ; Tue, 8 Feb 2011 22:08:04 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=to:from :subject:date:message-id:content-type:mime-version; q=dns; s= thelastpickle.com; b=1neZVf2j7EeYXAUgNedw8ipiNRPA+zFtxW3X8QRrItb QeBNsm/rzvNeSiDRSzJ9Adfcib/AkWmNKC2rVId1I3CQ4n1F3SI6W0AYsxTSfPIJ 4ejPX+HojUVWxBOeq1qyewPdo1fTbDoRBh9KvVzHAG+Qq0OfN4/RPEirACbbywtg = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=to :from:subject:date:message-id:content-type:mime-version; s= thelastpickle.com; bh=g8eEUz1rPeqkTaKpvMVhN6KNR6U=; b=eXnobhRdCR a9kyPjVqk6r5DKW1W9ZYKHtMEYq/gO1QTe1BmL3iKt0O5AoonWCGy3h8eTmVobXw addAwDM1GBWnF+FyL21g801HaZhPCSJ2lxOrb0BUq/+i48LqgYz4mGDXV5JeuLUO ZKstBRns9hn6MGSwPrSAgueAzl4VOH2z8= Received: from localhost (webms.mac.com [17.148.16.116]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a79.g.dreamhost.com (Postfix) with ESMTPSA id 0E7787D4057 for ; Tue, 8 Feb 2011 22:08:04 -0800 (PST) To: Cassandra User From: Aaron Morton Subject: ApplicationState Schema has drifted from DatabaseDescriptor Date: Wed, 09 Feb 2011 06:08:02 GMT X-Mailer: MobileMe Mail (1C3224) Message-id: <4749b17b-5fd4-62fa-88b5-eba40a61726f@me.com> Content-Type: multipart/alternative; boundary=Apple-Webmail-42--e6edefca-8dee-e1f3-b00b-da8e102e858b MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Webmail-42--e6edefca-8dee-e1f3-b00b-da8e102e858b Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=ISO-8859-1; format=flowed I noticed this after I upgraded one node in a 0.7 cluster of 5 to the late= st stable 0.7 build "2011-02-08_20-41-25" (upgraded =A0node was jb-cass1 b= elow). This is a long email, you can jump to the end and help me out by ch= ecking something on your =A00.7 cluster.=A0=0A=0AThis is the value from o.= a.c.gms.FailureDetector.AllEndpointStates on jb-cass05 9114.67)=0A=0A/192.= 168.114.63 =A0 X3:2011-02-08_20-41-25 =A0 SCHEMA:2f555eb0-3332-11e0-9e8d-c= 4f8bbf76455 =A0 LOAD:2.84182972E8 =A0 STATUS:NORMAL,0=0A/192.168.114.64 =A0= SCHEMA:2f555eb0-3332-11e0-9e8d-c4f8bbf76455 =A0 LOAD:2.84354156E8 =A0 STA= TUS:NORMAL,34028236692093846346337460743176821145=A0=0A/192.168.114.66 =A0= SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455 =A0 LOAD:2.59171601E8 =A0 STA= TUS:NORMAL,102084710076281539039012382229530463435=A0=0A/192.168.114.65 =A0= SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455 =A0 LOAD:2.70907168E8 =A0 STA= TUS:NORMAL,68056473384187692692674921486353642290=A0=0Ajb08.wetafx.co.nz/1= 92.168.114.67 =A0 SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455 =A0 LOAD:1.1= 55260665E9 =A0 STATUS:NORMAL,136112946768375385385349842972707284580=A0=0A= =0ANotice the schema for nodes 63 and 64 starts with 2f55 and for 65, 66 a= nd 67 it starts with 075.=0A=0AThis is the output from pycassa calling des= cribe_versions when connected to both the 63 (jb-cass1) and 67 (jb-cass5) = nodes=0A=0AIn [34]: sys.describe_schema_versions()=0AOut[34]:=A0=0A{'2f555= eb0-3332-11e0-9e8d-c4f8bbf76455': ['192.168.114.63',=0A=A0=A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0'192.16= 8.114.64',=0A=A0=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0'192.168.114.65',=0A=A0=A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0'192.168.114.66',=0A= =A0=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0'192.168.114.67']}=0A=0AIt's reporting all nodes on the 2f55 s= chema. The SchemaCheckVerbHandler is getting the value from DatabaseDescri= ptor. FailureDetector MBean is getting them from Gossiper.endpointStateMap= .=A0Requests are working though, so the CFid's must be matching up.=A0=0A= =0ACommit=A0https://github.com/apache/cassandra/commit/ecbd71f6b4bb004d26e= 585ca8a7e642436a5c1a4=A0added code to the 0.7 branch in the HintedHandOffM= anager to check the schema versions of nodes it has hints for. This is now= failing on the new node as follows...=0A=0AERROR [HintedHandoff:1] 2011-0= 2-09 16:11:23,559 AbstractCassandraDaemon.java (line org.apache.cassandra.= service.AbstractCassandraDaemon$1.uncaughtException(AbstractCassandraDaemo= n.java:114)) Fatal exception in thread Thread[HintedHandoff:1,1,main]=0Aja= va.lang.RuntimeException: java.lang.RuntimeException: Could not reach sche= ma agreement with /192.168.114.64 in 60000ms=0A=A0=A0 =A0 =A0 =A0at org.ap= ache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)=0A=A0=A0= =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Thre= adPoolExecutor.java:886)=0A=A0=A0 =A0 =A0 =A0at java.util.concurrent.Threa= dPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)=0A=A0=A0 =A0 =A0 =A0= at java.lang.Thread.run(Thread.java:619)=0ACaused by: java.lang.RuntimeExc= eption: Could not reach schema agreement with /192.168.114.64 in 60000ms=0A= =A0=A0 =A0 =A0 =A0at org.apache.cassandra.db.HintedHandOffManager.waitForS= chemaAgreement(HintedHandOffManager.java:256)=0A=A0=A0 =A0 =A0 =A0at org.a= pache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandO= ffManager.java:267)=0A=A0=A0 =A0 =A0 =A0at org.apache.cassandra.db.HintedH= andOffManager.access$100(HintedHandOffManager.java:88)=0A=A0=A0 =A0 =A0 =A0= at org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOf= fManager.java:391)=0A=A0=A0 =A0 =A0 =A0at org.apache.cassandra.utils.Wrapp= edRunnable.run(WrappedRunnable.java:30)=0A=A0=A0 =A0 =A0 =A0... 3 more=0A=0A= (the nodes can all see each other, checked with notetool during the 60 sec= onds)=0A=0AIf I restart one of the nodes with the 075 schema (without upgr= ading it) it reads the schema from the system tables and goes back to the = 2f55 schema. e.g. the 64 node was also on the 075 schema, I restarted and = it moved to the 2f55 and logged appropriately. While writing this email I = checked again with the 65 node, and the schema if was reporting to other n= odes changed after a restart from 075 to 2f55=0A=0AINFO [main] 2011-02-09 = 17:17:11,457 DatabaseDescriptor.java (line org.apache.cassandra.config.Dat= abaseDescriptor) Loading schema version 2f555eb0-3332-11e0-9e8d-c4f8bbf764= 55=0A=0AI've been reading the code for migrations and gossip don't have a = theory as to what is going on.=A0=0A=0A=0AREQUEST FOR HELP:=A0=0A=0AIf you= have a 0.7 cluster can you please check if this has happened so I can kno= w this is a real problem or just an Aaron problem. You can check by...=0A-= getting the values from the o.a.c.gms.FailureDetector.AllEndPointStates=0A= - running describe_schema_versions via the API, here is how to do it via p= ycassa=A0http://pycassa.github.com/pycassa/api/pycassa/system_manager.html= ?highlight=3Ddescribe_schema_versions=0A- checking at the schema ids' from= the failure detector match the result from describe_schema_versions()=0A-= if they do not match can you also include some info on what sort of schem= a changes have happened on the box.=0A=0AThanks=0AAaron=0A=0A --Apple-Webmail-42--e6edefca-8dee-e1f3-b00b-da8e102e858b Content-Type: multipart/related; type="text/html"; boundary=Apple-Webmail-86--e6edefca-8dee-e1f3-b00b-da8e102e858b --Apple-Webmail-86--e6edefca-8dee-e1f3-b00b-da8e102e858b Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=ISO-8859-1;
I noticed this after I upgraded one= node in a 0.7 cluster of 5 to the latest stable 0.7 build "2011-02-08_20-= 41-25" (upgraded  node was jb-cass1 below). This is a long email, you= can jump to the end and help me out by checking something on your  0= 7 cluster. 

<= /span>
This is the value from o.a.c.gms.FailureDetector.AllEndpointStates on j= b-cass05 9114.67)

/192.168.114.63   X3:2011-02-08_20-41-25 &= nbsp; SCHEMA:2f555eb0-3332-11e0-9e8d-c4f8bbf76455   LOAD:2.84182972E8=   STATUS:NORMAL,0
/192.168.114.64   SCHEMA:2f555eb0-3332-11e= 0-9e8d-c4f8bbf76455   LOAD:2.84354156E8   STATUS:NORMAL,34028236= 692093846346337460743176821145 
/192.168.114.66   SCHEMA:075c= bd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:2.59171601E8   STATUS:NO= RMAL,102084710076281539039012382229530463435 
/192.168.114.65 &nbs= p; SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:2.70907168E8 &n= bsp; STATUS:NORMAL,68056473384187692692674921486353642290 
jb08.we= tafx.co.nz/192.168.114.67   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf764= 55   LOAD:1.155260665E9   STATUS:NORMAL,136112946768375385385349= 842972707284580 

Notice the schema for nodes= 63 and 64 starts with 2f55 and for 65, 66 and 67 it starts with 075.

This is the output from pycassa calling describe_vers= ions when connected to both the 63 (jb-cass1) and 67 (jb-cass5) nodes

In [34]: sys.describe_schema_versions()
Out[34]: 
{'2f555eb0-3332-11e0-9e8d-c4f8bbf76455': ['192.= 168.114.63',
             &nb= sp;                     =        '192.168.114.64',
    =                     &nb= sp;                '192.168.114.65= ',
                =                     &nb= sp;    '192.168.114.66',
       &n= bsp;                    =              '192.168.114.67']}
<= /div>

It's reporting all nodes on the 2f55 schema. The = SchemaCheckVerbHandler is getting the value from DatabaseDescriptor. Failu= reDetector MBean is getting them from Gossiper.endpointStateMap . Requests are wor= king though, so the CFid's must be matching up. 

Commit https://github.com= /apache/cassandra/commit/ecbd71f6b4bb004d26e585ca8a7e642436a5c1a4 = ;added code to the 0.7 branch in the HintedHandOffManager to check the sch= ema versions of nodes it has hints for. This is now failing on the new nod= e as follows...

ERROR [HintedHandoff:1] 2011-02-09 16:11:23,559 AbstractCassandraDae= mon.java (line org.apache.cassandra.service.AbstractCassandraDaemon$1.unca= ughtException(AbstractCassandraDaemon.java:114)) Fatal exception in thread= Thread[HintedHandoff:1,1,main]
java.lang.RuntimeException: java.lang.RuntimeException: Could not= reach schema agreement with /192.168.114.64 in 60000ms
<= span style=3D"font-size: 14px; ">        at org.a= pache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)<= /div>
       &nb= sp;at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExe= cutor.java:886)
 &= nbsp;      at java.util.concurrent.ThreadPoolExecutor$Worke= r.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Threa= d.java:619)
Caused by: = java.lang.RuntimeException: Could not reach schema agreement with /192.168= 114.64 in 60000ms
&nbs= p;       at org.apache.cassandra.db.HintedHandOffManag= er.waitForSchemaAgreement(HintedHandOffManager.java:256)
=         at org.= apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHand= OffManager.java:267)
&n= bsp;       at org.apache.cassandra.db.HintedHandOffMan= ager.access$100(HintedHandOffManager.java:88)
        at org.apache.cass= andra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:391)=
     &n= bsp;  at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnab= le.java:30)
  = ;      ... 3 more

(th= e nodes can all see each other, checked with notetool during the 60 second= s)

If I restart one of the nodes with the 075 schema (without u= pgrading it) it reads the schema from the system tables and goes back to t= he 2f55 schema. e.g. the 64 node was also on the 075 schema, I restarted a= nd it moved to the 2f55 and logged appropriately. While writing this email= I checked again with the 65 node, and the schema if was reporting to othe= r nodes changed after a restart from 075 to 2f55

= INFO [main] 20= 11-02-09 17:17:11,457 DatabaseDescriptor.java (line org.apache.cassandra.c= onfig.DatabaseDescriptor) Loading schema version 2f555eb0-3332-11e0-9e8d-c= 4f8bbf76455

I've been reading the code for migrations and gossi= p don't have a theory as to what is going on. 


=
REQ= UEST FOR HELP: 

If you have a 0.7 cluster can you please c= heck if this has happened so I can know this is a real problem or just an = Aaron problem. You can check by...
- getting the values from the o.a.c.= gms.FailureDetector.AllEndPointStates
- running describe_schema_version= s via the API, here is how to do it via pycassa http://pycassa.github.com/pycassa/api/pycassa/sy= stem_manager.html?highlight=3Ddescribe_schema_versions
=
- checki= ng at the schema ids' from the failure detector match the result from desc= ribe_schema_versions()
- if they do not match can you also include som= e info on what sort of schema changes have happened on the box.

Aaron

--Apple-Webmail-86--e6edefca-8dee-e1f3-b00b-da8e102e858b-- --Apple-Webmail-42--e6edefca-8dee-e1f3-b00b-da8e102e858b--