cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Pham (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6255) Exception count not incremented on OutOfMemoryError (HSHA)
Date Thu, 16 Jan 2014 22:49:22 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874096#comment-13874096
] 

Paul Pham commented on CASSANDRA-6255:
--------------------------------------

We've experienced almost precisely the same issue, both in the sense that cassandra stopped
listening on 9160 due to an apparent out-of-memory problem, and also in observing the "invalid
frame size of 0" error, which we've also observed to a lesser degree in the past (but trending
up as of late) and haven't been able to root cause.

One other interesting data point is that all 3 of our nodes in our main datacenter experienced
this same issue at the same time. Our working theory is that the trigger was the starting
of many applications simultaneously was too much for the current cluster to handle (with both
reads and writes using a consistency level of TWO). 

Relevant log snippet:
{code}
INFO [ScheduledTasks:1] 2014-01-16 19:01:45,208 GCInspector.java (line 119) GC for ConcurrentMarkSweep:
921 ms for 1 collections, 1389728032 used; max is 2105540608
ERROR [Selector-Thread-0] 2014-01-16 19:01:48,813 CustomTHsHaServer.java (line 187) Uncaught
Exception:
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(Unknown Source)
        at java.nio.ByteBuffer.allocate(Unknown Source)
        at org.apache.thrift.server.TNonblockingServer$FrameBuffer.read(TNonblockingServer.java:491)
        at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.handleRead(CustomTHsHaServer.java:273)
        at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.select(CustomTHsHaServer.java:224)
        at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.run(CustomTHsHaServer.java:182)
 INFO [ScheduledTasks:1] 2014-01-16 19:01:48,822 GCInspector.java (line 119) GC for ConcurrentMarkSweep:
3125 ms for 2 collections, 1390753448 used; max is 2105540608
{code}

Strangely, you can see that we're not really close to our max heap limit, although it was
inching close to our configured "CMSInitiatingOccupancyFraction" which is effectively 1.5/2GB.

We were ready to chalk this up to an increased load on the cassandra cluster (and a need for
beefier/more nodes). But I'd also echo Dan's concern that the health reporting is not sufficient/accurate
for this particular issue (which I think is the main ask for this bug). We were only able
to detect this issue because we had a monit check checking for expected listening ports.



> Exception count not incremented on OutOfMemoryError (HSHA)
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-6255
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6255
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Oracle java version "1.7.0_15"
> rpc_server_type: hsha
>            Reporter: Dan Hendry
>            Priority: Minor
>
> One of our nodes decided to stop listening on 9160 (netstat -l was showing nothing and
telnet was reporting connection refused). Nodetool status showed no hosts down and on the
offending node nodetool info gave the following:
> {noformat}
> nodetool info
> Token            : (invoke with -T/--tokens to see all 256 tokens)
> ID               : (removed)
> Gossip active    : true
> Thrift active    : true
> Native Transport active: false
> Load             : 2.05 TB
> Generation No    : 1382536528
> Uptime (seconds) : 432970
> Heap Memory (MB) : 8098.05 / 14131.25
> Data Center      : DC1
> Rack             : RAC2
> Exceptions       : 0
> Key Cache        : size 536854996 (bytes), capacity 536870912 (bytes), 41383646 hits,
1710831591 requests, 0.024 recent hit rate, 0 save period in seconds
> Row Cache        : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent
hit rate, 0 save period in seconds
> {noformat}
> After looking at the cassandra log, I saw a bunch of the following:
> {noformat}
> ERROR [Selector-Thread-16] 2013-10-27 17:36:00,370 CustomTHsHaServer.java (line 187)
Uncaught Exception: 
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:691)
>         at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:145)
>         at org.apache.cassandra.thrift.CustomTHsHaServer.requestInvoke(CustomTHsHaServer.java:337)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.handleRead(CustomTHsHaServer.java:281)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.select(CustomTHsHaServer.java:224)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.run(CustomTHsHaServer.java:182)
> ERROR [Selector-Thread-7] 2013-10-27 17:36:00,370 CustomTHsHaServer.java (line 187) Uncaught
Exception: 
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:691)
>         at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:145)
>         at org.apache.cassandra.thrift.CustomTHsHaServer.requestInvoke(CustomTHsHaServer.java:337)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.handleRead(CustomTHsHaServer.java:281)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.select(CustomTHsHaServer.java:224)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.run(CustomTHsHaServer.java:182)
> {noformat}
> There wasn't anything else overtly suspicious in the logs except for the occasional 
> {noformat}
> ERROR [Selector-Thread-0] 2013-10-27 17:35:58,662 TNonblockingServer.java (line 468)
Read an invalid frame size of 0. Are you using TFramedTransport on the client side?
> {noformat}
> but  that periodically comes up - I have looked into it before but it has never seemed
to have any serious impact.
> This ticket is not about *why* an OutOfMemoryError occurred - which is bad but I don't
think I have enough information to reproduce or speculate on a cause. This ticket is about
the fact that an OutOfMemoryError occurred and nodetool info was reporting Thrift active :
true and Exceptions : 0. 
> Our monitoring systems and investigation processes are both starting to rely on on the
exception count. The fact that it was not accurate here is disconcerting.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message