cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua McKenzie (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-7507) OOM creates unreliable state - die instantly better
Date Wed, 08 Oct 2014 14:23:34 GMT


Joshua McKenzie commented on CASSANDRA-7507:

Fair enough on the build process - the xml changes aren't that bad but if something goes south
with it, having to muck around with a python script adds quite a bit of complexity to the
process.  I'll follow up on the CI checking separately.

For points 1 and 2, I have CASSANDRA-7579 and CASSANDRA-7927 both that are dependent on this
ticket, so the generic approach was future-proofing against the other efforts we already have
waiting on deck.  Otherwise I'd totally agree with you on that count.

I'll take care of the logger.fatal change on commit.

> OOM creates unreliable state - die instantly better
> ---------------------------------------------------
>                 Key: CASSANDRA-7507
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Karl Mueller
>            Assignee: Joshua McKenzie
>            Priority: Minor
>             Fix For: 2.1.1
>         Attachments: 7507_v1.txt, 7507_v2.txt, 7507_v3_build.txt, 7507_v3_java.txt, exceptionHandlingResults.txt,
> I had a cassandra node run OOM. My heap had enough headroom, there was just something
which either was a bug or some unfortunate amount of short-term memory utilization. This resulted
in the following error:
>  WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 (line 1713)
Some hints were not written before shutdown.  This is not supposed to happen.  You should
(a) run repair, and (b) file a bug report
> There are no other messages of relevance besides the OOM error about 90 minutes earlier.
> My (limited) understanding of the JVM and Cassandra says that when it goes OOM, it will
attempt to signal cassandra to shut down "cleanly". The problem, in my view, is that with
an OOM situation, nothing is guaranteed anymore. I believe it's impossible to reliably "cleanly
shut down" at this point, and therefore it's wrong to even try. 
> Yes, ideally things could be written out, flushed to disk, memory messages written, other
nodes notified, etc. but why is there any reason to believe any of those steps could happen?
Would happen? Couldn't bad data be written at this point to disk rather than good data? Some
network messages delivered, but not others?
> I think Cassandra should have the option to (and possibly default) to kill itself immediately
upon the OOM condition happening in a hard way, and not rely on the java-based clean shutdown
process. Cassandra already handles recovery from unclean shutdown, and it's not a big deal.
My node, for example, kept in a sort-of alive state for 90 minutes where who knows what it
was doing or not doing.
> I don't know enough about the JVM and options for it to know the best exact implementation
of "die instantly on OOM", but it should be something that's possible either with some flags
or a C library (which doesn't rely on java memory to do something which it may not be able
to get!)
> Short version: a kill -9 of all C* processes in that instance without needing more java
memory, when OOM is raised

This message was sent by Atlassian JIRA

View raw message