cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Delaney Manders (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4225) EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
Date Tue, 08 May 2012 16:29:48 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270582#comment-13270582
] 

Delaney Manders commented on CASSANDRA-4225:
--------------------------------------------

Would be happy to provide login credentials to the most recently crashed machine an active
contributor who wants to see the environment first-hand.
                
> EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-4225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.0
>         Environment: Amazon Linux AMI release 2012.03
> 3.2.12-3.2.4.amzn1.x86_64
> m1.xlarge
> Nodes have:
> Cassandra built and installed from source.
> Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), libtool(2.2.10)
installed from AWS repository.
> Sun Java:
> > java -version
> java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> Only system changes are:
> echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
> echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
> Setup scripts available.
> Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 4, DC2
being reserved for Hadoop jobs.  DC2 nodes have not had the same frequency of hard crashes,
though it has happened.
> Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives raided for
storage.
> Usage is exclusively write, with all mutations being done in batch mutations, where each
batch mutation has a set of columns added/modified to a single key.  There are ~2000 threads
streaming batch mutations from a web edge of varying size, distributed across DC1.  Client
is Hector(1.0-5) w/ DynamicLoadBalancing.
> In an effort to mitigate this issue, I've removed jna.jar & platform.jar from $CASSANDRA_HOME/lib,
and set disk_access_mode: standard in $CASSANDRA_HOME/conf.cassandra.yaml.  Neither has seemed
to help.
>            Reporter: Delaney Manders
>
> At fairly random intervals, about once/day, one of my Cassandra nodes does a hard crash
(kernel panic).  
>   
> I can find no system logs (/var/log/*) which have any errors.  No cassandra logs have
any errors.  
>   
> On one machine I was watching as it went down, and caught the following comment:  
> > Message from syslogd@domU-12-31-38-00-64-31 at May  3 18:24:17 ...
> >  kernel:[252906.019808] Oops: 0002 [#1] SMP
> An AWS support guy found one entry in the console logs:
> > [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1
> I've replaced two of the nodes with new instances, but all are showing the same behaviour.
> It's very reproduceable on my system, though it takes a little waiting.  Leaving it running
is no big deal for another day or so, I just need to restart Cassandra every once in a while
when I get alerted.  
> I'm open to any additional requested debugging steps before bailing and going back to
1.0.9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message