cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5725) Silently failing messages in case of schema not fully propagated
Date Fri, 05 Jul 2013 09:31:48 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700563#comment-13700563
] 

Sylvain Lebresne commented on CASSANDRA-5725:
---------------------------------------------

bq. the fix here is to special case UnknownColumnFamilyException so that it gets logged at
INFO or WARN instead of being swallowed by the default IOException handler

Agreed, at least it's by far the simplest fix and it's probably good enough in practice.

I don't think getting more fancy is worth the complexity that it would add. "you perform your
schema change, then you check for schema agreement before starting to write to the new table"
is not that hard a rule to follow, and a good client driver will do that for you under the
hood anyway :) 
                
> Silently failing messages in case of schema not fully propagated
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-5725
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5725
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.2.6
>            Reporter: Sergio Bossa
>
> When a new keyspace and/or column family is created on a multi nodes cluster (at least
three), and then a mutation is executed on such new column family, the operations sometimes
silently fails by timing out.
> I tracked this down to the schema not being fully propagated to all nodes. Here's what
happens:
> 1) Node 1 receives the create keyspace/column family request.
> 2) The same node receives a mutation request at CL.QUORUM and sends to other nodes too.
> 3) Upon receiving the mutation request, other nodes try to deserialize it and fail in
doing so if the schema is not fully propagated, i.e. because they don't find the mutated column
family.
> 4) The connection between node 1 and the failed node is dropped, and the request on the
former hangs until timing out.
> Here is the underlying exception, I had to tweak several log levels to get it: 
> {noformat}
> INFO 13:11:39,441 IOException reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=a31c7604-0e40-393b-82d7-ba3d910ad50a
> 	at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:184)
> 	at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:94)
> 	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:397)
> 	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:407)
> 	at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:367)
> 	at org.apache.cassandra.net.MessageIn.read(MessageIn.java:94)
> 	at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:207)
> 	at org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:139)
> 	at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:82)
> {noformat}
> Finally, there's probably a correlated failure happening during repairs of newly created/mutated
column family, causing the repair process to hang forever as follows:
> {noformat}
> "AntiEntropySessions:1" daemon prio=5 tid=7fe981148000 nid=0x11abea000 in Object.wait()
[11abe9000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <7c6200840> (a org.apache.cassandra.utils.SimpleCondition)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
> 	- locked <7c6200840> (a org.apache.cassandra.utils.SimpleCondition)
> 	at org.apache.cassandra.service.AntiEntropyService$RepairSession.runMayThrow(AntiEntropyService.java:695)
> 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> 	at java.lang.Thread.run(Thread.java:680)
> "http-8983-1" daemon prio=5 tid=7fe97d24d000 nid=0x11a5c8000 in Object.wait() [11a5c6000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <7c620db58> (a org.apache.cassandra.utils.SimpleCondition)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
> 	- locked <7c620db58> (a org.apache.cassandra.utils.SimpleCondition)
> 	at org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2442)
> 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at org.apache.cassandra.service.StorageService.forceTableRepairRange(StorageService.java:2409)
> 	at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2387)
> 	at com.datastax.bdp.cassandra.index.solr.SolrCoreResourceManager.repairResources(SolrCoreResourceManager.java:693)
> 	at com.datastax.bdp.cassandra.index.solr.SolrCoreResourceManager.createCore(SolrCoreResourceManager.java:255)
> 	at com.datastax.bdp.cassandra.index.solr.CassandraCoreAdminHandler.handleCreateAction(CassandraCoreAdminHandler.java:121)
> 	at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:144)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 	at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:615)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:206)
> {noformat}
> I wasn't able to track any exception as I can't reproduce it reliably enough, but I believe
it's correlated to schema propagation as based on log messages the merkle tree request on
node 1 happens concurrently to schema installation on other nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message