cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Maznichenko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
Date Thu, 02 Apr 2015 08:53:53 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392398#comment-14392398
] 

Sergey Maznichenko commented on CASSANDRA-9092:
-----------------------------------------------

Java heap is selected automatically in cassandra-env.sh. I tried to set MAX_HEAP_SIZE="8G",
NEW_HEAP_SIZE="800M", but it didn't help.

nodetool disableautocompaction - didn't help, compactions continue after restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop running hint compaction'.

One of nodes had ~24000 files in system\hints-..., I stepped node and deleted them, it helps
and node is running about 10 hours. Other node has 18154 files in system\hints-... (~1.1TB)
and has the same problem, I leave it for experiments.

Workload: 20-40 processes on application servers, each one performs loading files in blobs
(one big table), size of each file is about 3.5MB, key - UUID.

CREATE KEYSPACE filespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1',
'DC2': '1'}  AND durable_writes = true;

CREATE TABLE filespace.filestorage (
    key text,
    filename text,
    value blob,
    PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
    AND CLUSTERING ORDER BY (chunk ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

nodetool status filespace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns (effective)  Host ID                           
   Rack
UN  10.X.X.12   4.82 TB    256     28.0%             25cefe6a-a9b1-4b30-839d-46ed5f4736cc
 RAC1
UN  10.X.X.13   3.98 TB    256     22.9%             ef439686-1e8f-4b31-9c42-f49ff7a8b537
 RAC1
UN  10.X.X.10   4.52 TB    256     26.1%             a11f52a6-1bff-4b47-bfa9-628a55a058dc
 RAC1
UN  10.X.X.11   4.01 TB    256     23.1%             0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52
 RAC1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns (effective)  Host ID                           
   Rack
UN  10.X.X.137  4.64 TB    256     22.6%             e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd
 RAC1
UN  10.X.X.136  1.25 TB    256     27.2%             c8360341-83e0-4778-b2d4-3966f083151b
 RAC1
DN  10.X.X.139  4.81 TB    256     25.8%             1f434cfe-6952-4d41-8fc5-780a18e64963
 RAC1
UN  10.X.X.138  3.69 TB    256     24.4%             b7467041-05d9-409f-a59a-438d0a29f6a7
 RAC1

I need some workaround to prevent this situation with hints. 

How we use dafault values for:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 2
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024

Should I disable hints or increase number of threads and throughput?

For example:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 20
max_hint_window_in_ms: 108000000
hinted_handoff_throttle_in_kb: 10240


> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9092
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>            Reporter: Sergey Maznichenko
>             Fix For: 2.1.5
>
>         Attachments: cassandra_crash1.txt
>
>
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2
stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many
files in system.hints table and error appears in 2-3 minutes after starting system.hints auto
compaction.
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message