cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Tunnicliffe (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
Date Fri, 03 Apr 2015 12:40:53 GMT


Sam Tunnicliffe commented on CASSANDRA-9092:

What consistency level are you writing at? 
How are your clients performing the writes, thrift or native protocol?
How do your clients balance requests? Are they simply sending them round robin or using token
aware routing? Are you writing in only one DC or to both?
Are there errors or warnings in the logs of the nodes which don't fail? 

Also, I don't think the schema you posted is complete as the primary key includes a {{chunk}}
column not in the table definition.

If this is a not your regular workload (i.e. it's a periodic bulk load) and you expect the
normal usage pattern to be different, disabling hinted handoff temporarily may be a reasonable
workaround for you, provided you aren't relying on CL.ANY and your clients handle {{UnavailableException}}
sanely. You'll also need to run repair after the load completes. 
If that isn't an option, bumping the delivery threads and opening the throttle might prevent
a huge hints buildup if you have sufficient bandwidth and CPU, but I doubt it will help much
as the nodes or network are clearly already overwhelmed otherwise there wouldn't be so many
hints being written in the first place. 

> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>                 Key: CASSANDRA-9092
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>            Reporter: Sergey Maznichenko
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.1.5
>         Attachments: cassandra_crash1.txt
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2
stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many
files in system.hints table and error appears in 2-3 minutes after starting system.hints auto
> Stops, means "ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456
- Exception in thread Thread[CompactionExecutor:1,1,main]
> java.lang.OutOfMemoryError: Java heap space"
> ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 - Exception
in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
Java heap space
> Full errors listing attached in cassandra_crash1.txt
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.

This message was sent by Atlassian JIRA

View raw message