cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6476) Assertion error in MessagingService.addCallback
Date Thu, 12 Dec 2013 11:40:07 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846256#comment-13846256
] 

Sylvain Lebresne commented on CASSANDRA-6476:
---------------------------------------------

MessagingService ain't the native transport (fyi, the native transport code doesn't leak outside
the org.apache.cassandra.transport package), it's the intra-cluster messaging. In fact the
stack trace shows that the write that trigger it don't even come from the native protocol
but from thrift (which means you either use thrift for some things or something is whack).

But truth is, given the stack trace, where the writes comes from doesn't matter.  The assertion
that fails is the line
{noformat}
assert previous == null;
{noformat}
in MessagingService.addCallback. And that's where things stop to make sense to me. This means
that we tried to add a new message to the callback map but there was one with the same messageId
already. Except that messageId is very straighforwardly generated by an {{incrementAndGet}}
on an static AtomicInteger. And as far as I can tell, no other code inserts in the callback
map without grabing a new messageId this way (except setCallbackForTests, but it does is only
use in a unit test).

Therefore, it seems the only way such messageId conflict could happen is that we've gone full
cycle on the AtomicInteger and hit the same id again. But entries in callbacks expire after
the rpc timeout, so that implies > 4 billions requests in about 10 seconds. Sounds pretty
unlikely to me.

But I might be missing something obvious: [~jbellis], I believe you might be more familiar
with MessagingService, any idea?


> Assertion error in MessagingService.addCallback
> -----------------------------------------------
>
>                 Key: CASSANDRA-6476
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6476
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.0.2 DCE
>            Reporter: Theo Hultberg
>            Assignee: Sylvain Lebresne
>
> Two of the three Cassandra nodes in one of our clusters just started behaving very strange
about an hour ago. Within a minute of each other they started logging AssertionErrors (see
stack traces here: https://gist.github.com/iconara/7917438) over and over again. The client
lost connection with the nodes at roughly the same time. The nodes were still up, and even
if no clients were connected to them they continued logging the same errors over and over.
> The errors are in the native transport (specifically MessagingService.addCallback) which
makes me suspect that it has something to do with a test that we started running this afternoon.
I've just implemented support for frame compression in my CQL driver cql-rb. About two hours
before this happened I deployed a version of the application which enabled Snappy compression
on all frames larger than 64 bytes. It's not impossible that there is a bug somewhere in the
driver or compression library that caused this -- but at the same time, it feels like it shouldn't
be possible to make C* a zombie with a bad frame.
> Restarting seems to have got them back running again, but I suspect they will go down
again sooner or later.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message