incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From graham sanderson <gra...@vast.com>
Subject Re: Strange slow schema agreement on 2.0.9 ... anyone seen this? - knowsVersion may get stuck as false?
Date Sun, 10 Aug 2014 18:14:22 GMT
We saw this problem again today, so it certainly seems reasonable that it was introduced by
upgrade from 2.0.5 to 2.0.9 (we hadn’t seen it ever before that)
I think this must be related to https://issues.apache.org/jira/browse/CASSANDRA-6695 or https://issues.apache.org/jira/browse/CASSANDRA-6700
which were both implemented in 2.0.6
The reason I think it is a problem with choosing not to do the schema push, is a “trace”
of a manual table create on some nodes (where the problem occurs) does not send messages to
some other nodes, whereas if the table creation is done from another node it may send messages
to all nodes.

Note quite sure exactly what is/might be going on; seems like it could be a race of some kind
(note we have ALWAYS been on 2.0.x in this environment, so it isn’t an issue with 1.x) that
leaves the affected node with incorrect state about the other node’s version
I’m going to add logging in that code path (gossip still seems to indicate that everything
is up - it is certainly possible that earlier a node appeared to be down due to GC, but it
seems whatever state this causes does not resolve itself later - i.e. even though the schema
change is eventually propagated, future schema changes have the same problem). Note whatever
the conditions are, it seems to be a one way thing, i.e. A skips push to B, but then B happily
pulls from A.

Other than schema changes, nothing else seemed to be affected (if nodes thought several other
nodes were down, we’d likely see LOCAL_QUORUM operations fail)… this again points to the
new “getRawVersion” change which is only used by the schema push/pull (“getVersion”
assumes current version if no version info is known)… so there must be some sequence of
event that causes node A to (permanently(-ish?)) lose version information for node B. 

On Aug 8, 2014, at 5:06 PM, graham sanderson <graham@vast.com> wrote:

> Actually I think it is a different issue (or a freak issue)… the invocation in InternalResponseStage
is part of the “schema pull” mechanism this ticket relates to, and in my case this is
actually repairing (thank you) the schema disagreement because as a result of it eventually
being noticed by gossip. For whatever reason, the “schema push” mechanism got broken for
some nodes. Strange as I say since this push code looks for live nodes according to gossip
and all nodes were up according to gossip info at the time. So, sadly the new debug logging
in the pull path won’t help… if it happens again, I’ll have some more context to dig
deeper, before just getting in and fixing the problem by restarting the nodes which I did
today.
> 
> On Aug 8, 2014, at 4:37 PM, graham sanderson <graham@vast.com> wrote:
> 
>> Ok thanks - I guess I can at least enable the debug logging added for that issue
to see if it is deliberately choosing not to pull the schema… no repro case, but it may
happen again!
>> 
>> On Aug 8, 2014, at 4:21 PM, Robert Coli <rcoli@eventbrite.com> wrote:
>> 
>>> On Fri, Aug 8, 2014 at 1:45 PM, graham sanderson <graham@vast.com> wrote:
>>> We have some data that is partitioned in tables created periodically (once a
day). This morning, this automated process timed out because the schema did not reach agreement
quickly enough after we created a new empty table.
>>> 
>>> I have seen this on 1.2.16, but it was supposed to be fixed in 1.2.18 and 2.0.7.
>>> 
>>> https://issues.apache.org/jira/browse/CASSANDRA-6971
>>> 
>>> If you can repro on 2.0.9, I would file a JIRA with repro steps and link it on
a reply to this thread.
>>> 
>>> =Rob 
>> 
> 


Mime
View raw message