cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Delaney Manders (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-4225) EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
Date Tue, 08 May 2012 16:25:48 GMT
Delaney Manders created CASSANDRA-4225:
------------------------------------------

             Summary: EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
                 Key: CASSANDRA-4225
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.1.0
         Environment: Amazon Linux AMI release 2012.03
3.2.12-3.2.4.amzn1.x86_64
m1.xlarge

Nodes have:
Cassandra built and installed from source.
Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), libtool(2.2.10)
installed from AWS repository.
Sun Java:

> java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

Only system changes are:
echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf

Setup scripts available.

Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 4, DC2 being
reserved for Hadoop jobs.  DC2 nodes have not had the same frequency of hard crashes, though
it has happened.

Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives raided for storage.

Usage is exclusively write, with all mutations being done in batch mutations, where each batch
mutation has a set of columns added/modified to a single key.  There are ~2000 threads streaming
batch mutations from a web edge of varying size, distributed across DC1.  Client is Hector(1.0-5)
w/ DynamicLoadBalancing.

In an effort to mitigate this issue, I've removed jna.jar & platform.jar from $CASSANDRA_HOME/lib,
and set disk_access_mode: standard in $CASSANDRA_HOME/conf.cassandra.yaml.  Neither has seemed
to help.
            Reporter: Delaney Manders


At fairly random intervals, about once/day, one of my Cassandra nodes does a hard crash (kernel
panic).  
  
I can find no system logs (/var/log/*) which have any errors.  No cassandra logs have any
errors.  
  
On one machine I was watching as it went down, and caught the following comment:  
> Message from syslogd@domU-12-31-38-00-64-31 at May  3 18:24:17 ...
>  kernel:[252906.019808] Oops: 0002 [#1] SMP

An AWS support guy found one entry in the console logs:
> [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1

I've replaced two of the nodes with new instances, but all are showing the same behaviour.

It's very reproduceable on my system, though it takes a little waiting.  Leaving it running
is no big deal for another day or so, I just need to restart Cassandra every once in a while
when I get alerted.  

I'm open to any additional requested debugging steps before bailing and going back to 1.0.9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message