cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Maznichenko (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
Date Fri, 03 Apr 2015 14:28:53 GMT


Sergey Maznichenko commented on CASSANDRA-9092:

Consistency ONE. Clients use Datastax Client (Java).
We are writing only to DC1.

In the logs of the nodes which don't fail we have errors and warnings during load:

INFO  [SharedPool-Worker-5] 2015-03-31 15:48:52,534 - Unexpected exception
during request; channel = [id: 0x48b3ad12, / :> /10.XX.XX.10:9042] Error while read(...): Connection reset by peer
        at Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
        at ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
        at ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$
        at io.netty.util.concurrent.DefaultThreadFactory$
        at Source) [na:1.7.0_71]

ERROR [Thrift:15] 2015-03-31 11:54:35,163 - Error occurred
during processing of message.
java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation
timed out - received only 2 responses.
        at org.apache.cassandra.auth.Auth.selectUser( ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.auth.Auth.isExistingUser( ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.service.ClientState.login( ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.thrift.CassandraServer.login( ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(
        at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(
        at org.apache.thrift.ProcessFunction.process( ~[libthrift-0.9.1.jar:0.9.1]
        at org.apache.thrift.TBaseProcessor.process( ~[libthrift-0.9.1.jar:0.9.1]
        at org.apache.cassandra.thrift.CustomTThreadPoolServer$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.7.0_71]
        at java.util.concurrent.ThreadPoolExecutor$ Source) [na:1.7.0_71]
        at Source) [na:1.7.0_71]
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received
only 2 responses.
        at org.apache.cassandra.service.ReadCallback.get( ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.service.AbstractReadExecutor.get(
        at org.apache.cassandra.service.StorageProxy.fetchRows( ~[apache-cassandra-2.1.2.jar:2.1.2]
        at ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(
        at org.apache.cassandra.auth.Auth.selectUser( ~[apache-cassandra-2.1.2.jar:2.1.2]
        ... 11 common frames omitted

I've changed schema definition.
It's periodic workload, so I will disable hinted handoff temporary. Also I disabled compaction
for filespace.filestorage because it takes long time and gives <1% efficiency.

My hints parameters now:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 4
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 10240

I suppose Cassandra should do some kind of partial compaction if system.hints is big, or do
clean old hints before compaction. Do you have idea about nessesary changes in 2.1.5? 

> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>                 Key: CASSANDRA-9092
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>            Reporter: Sergey Maznichenko
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.1.5
>         Attachments: cassandra_crash1.txt
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2
stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many
files in system.hints table and error appears in 2-3 minutes after starting system.hints auto
> Stops, means "ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456
- Exception in thread Thread[CompactionExecutor:1,1,main]
> java.lang.OutOfMemoryError: Java heap space"
> ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 - Exception
in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
Java heap space
> Full errors listing attached in cassandra_crash1.txt
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.

This message was sent by Atlassian JIRA

View raw message