cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacek Furmankiewicz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8259) Add column family name when reporting OutOfMemory errors
Date Wed, 05 Nov 2014 20:30:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199010#comment-14199010
] 

Jacek Furmankiewicz commented on CASSANDRA-8259:
------------------------------------------------

well, the problem is that there is TOO much log info.

We have let's say 10 app servers. They all started throwing an exception at the same time.

It is very hard to figure out which one of these exceptions was the root cause and which ones
were just caused by Cassandra going down.

And honestly, the "building Thrift response" code should be smart enough to figure out it
is about to bring the whole server down.

A simple check on the number of rows and columns returned and their size and maybe throwing
a regular exception would have been a much better option than crashing the entire server.

I am not sure if the Cassandra team understand how a new technology like this  
looks like in the eyes if a conservative customer if a single query can crash it.

And the worse part is that if this query works fine on other customers, it's very data size
dependent, which varies greatly between customers.
So it's not obvious that a particular query is a threat to the stability of the underlying
DB.

We've had cases where multiple queries from servers at the same brought down the whole cluster
(not just a single node). 

Telling the customer that it is much more stable because it is a distributed DB is a difficult
argument to make after an event like that...

> Add column family name when reporting OutOfMemory errors
> --------------------------------------------------------
>
>                 Key: CASSANDRA-8259
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8259
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jacek Furmankiewicz
>
> When we get a Thrift error like this which causes a server crash:
> {noformat}
> ERROR [Thrift:33] 2014-11-05 17:36:07,486 CassandraDaemon.java (line 196)
> Exception in thread Thread[Thrift:33,5,main]
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow
> (ByteArrayOutputStream.java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity
> (ByteArrayOutputStream.java:93)
>         at java.io.ByteArrayOutputStream.write
> (ByteArrayOutputStream.java:140)
>         at org.apache.thrift.transport.TFramedTransport.write
> (TFramedTransport.java:146)
>         at org.apache.thrift.protocol.TBinaryProtocol.writeBinary
> (TBinaryProtocol.java:183)
>         at org.apache.cassandra.thrift.Column$ColumnStandardScheme.write
> (Column.java:678)
>         at org.apache.cassandra.thrift.Column$ColumnStandardScheme.write
> (Column.java:611)
>         at org.apache.cassandra.thrift.Column.write(Column.java:538)
>         at org.apache.cassandra.thrift.ColumnOrSuperColumn
> $ColumnOrSuperColumnStandardScheme.write(ColumnOrSuperColumn.java:673)
>         at org.apache.cassandra.thrift.ColumnOrSuperColumn
> $ColumnOrSuperColumnStandardScheme.write(ColumnOrSuperColumn.java:607)
>         at org.apache.cassandra.thrift.ColumnOrSuperColumn.write
> (ColumnOrSuperColumn.java:517)
>         at org.apache.cassandra.thrift.Cassandra$multiget_slice_result
> $multiget_slice_resultStandardScheme.write(Cassandra.java:14559)
>         at org.apache.cassandra.thrift.Cassandra$multiget_slice_result
> $multiget_slice_resultStandardScheme.write(Cassandra.java:14463)
>         at org.apache.cassandra.thrift.Cassandra
> $multiget_slice_result.write(Cassandra.java:14393)
>         at org.apache.thrift.ProcessFunction.process
> (ProcessFunction.java:53)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>         at org.apache.cassandra.thrift.CustomTThreadPoolServer
> $WorkerProcess.run(CustomTThreadPoolServer.java:194)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker
> (ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run
> (ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>  INFO [StorageServiceShutdownHook] 2014-11-05 17:36:07,488
> ThriftServer.java (line 141) Stop listening to thrift clients
> {noformat}
> we have no clue as to which column family was being queried. That makes it extremely
difficult to troubleshoot which query in a complex code base caused this error.
> We have multiple servers and they all throw a NoAvailableHostException in Astyanax at
the same time, all in different parts of the code...so figuring out the root cause is an exercise
in frustration that takes many hours.
> At least listing the column family in this message would save us COUNTLESS hours of troubleshooting.
> We're on 2.0.8, JDK 1.7, RHEL 6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message