cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Kovgan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10937) OOM on multiple nodes on write load (v. 3.0.0), problem also present on DSE-4.8.3, but there it survives more time
Date Mon, 25 Jan 2016 05:49:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114782#comment-15114782
] 

Peter Kovgan edited comment on CASSANDRA-10937 at 1/25/16 5:49 AM:
-------------------------------------------------------------------

Jack.
This (row count) is from last tests, where we abandoned "complex" tests with multiple nodes
and only test 1 node (no ring) without replication, just to get feeling where is the maximum
load for that node.

I have no data regarding row count in tests with multiple nodes.

Today's 5-th day of test of the low (5Mb/sec load), it is still working.
I see disk IO stats, they show no increase in %iowait (as in OOM tests), so I'm pretty sure
the reason was poor IO and great load.
We plan increase load to find node's maximum.
Then attach other nodes, do RF=2 and figure out maximum for that.

The problem is "too high supply with gradually degrating IO".
If there were some method to alert about that situation early, it would be great. (for future
generations of testers).





was (Author: tierhetze):
Jack.
This (row count) is from last tests, where we abandoned "complex" tests with multiple nodes
and only test 1 node (no ring) without replication, just to get feeling where is the maximum
load for that node.

I have no data regarding row count in tests with multiple nodes.

Today's 5-th day of test of the low (5Mb/sec load), it is still working.
I see disk IO stats, they show no increase in %iowait (as in OOM tests), so I'm pretty sure
the reason was poor IO and great load.
We plan increase load to find node's maximum.
Then attach other nodes, do RF=2 and figure out maximum for that.

The problem is "too high supply with gradually degrating IO".
If there were some method to alert about that situation early, it would be great. (for future
generations of testers).
I wonder, may be I should join open source and start contribute :)
Just kidding.



> OOM on multiple nodes on write load (v. 3.0.0), problem also present on DSE-4.8.3, but
there it survives more time
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10937
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra : 3.0.0
> Installed as open archive, no connection to any OS specific installer.
> Java:
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> OS :
> Linux version 2.6.32-431.el6.x86_64 (mockbuild@x86-023.build.eng.bos.redhat.com) (gcc
version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Sun Nov 10 22:19:54 EST 2013
> We have:
> 8 guests ( Linux OS as above) on 2 (VMWare managed) physical hosts. Each physical host
keeps 4 guests.
> Physical host parameters(shared by all 4 guests):
> Model: HP ProLiant DL380 Gen9
> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
> 46 logical processors.
> Hyperthreading - enabled
> Each guest assigned to have:
> 1 disk 300 Gb for seq. log (NOT SSD)
> 1 disk 4T for data (NOT SSD)
> 11 CPU cores
> Disks are local, not shared.
> Memory on each host -  24 Gb total.
> 8 (or 6, tested both) Gb - cassandra heap
> (lshw and cpuinfo attached in file test2.rar)
>            Reporter: Peter Kovgan
>            Priority: Critical
>         Attachments: cassandra-to-jack-krupansky.docx, gc-stat.txt, more-logs.rar, some-heap-stats.rar,
test2.rar, test3.rar, test4.rar, test5.rar, test_2.1.rar, test_2.1_logs_older.rar, test_2.1_restart_attempt_log.rar
>
>
> 8 cassandra nodes.
> Load test started with 4 clients(different and not equal machines), each running 1000
threads.
> Each thread assigned in round-robin way to run one of 4 different inserts. 
> Consistency->ONE.
> I attach the full CQL schema of tables and the query of insert.
> Replication factor - 2:
> create keyspace OBLREPOSITORY_NY with replication = {'class':'NetworkTopologyStrategy','NY':2};
> Initiall throughput is:
> 215.000  inserts /sec
> or
> 54Mb/sec, considering single insert size a bit larger than 256byte.
> Data:
> all fields(5-6) are short strings, except one is BLOB of 256 bytes.
> After about a 2-3 hours of work, I was forced to increase timeout from 2000 to 5000ms,
for some requests failed for short timeout.
> Later on(after aprox. 12 hous of work) OOM happens on multiple nodes.
> (all failed nodes logs attached)
> I attach also java load client and instructions how set-up and use it.(test2.rar)
> Update:
> Later on test repeated with lesser load (100000 mes/sec) with more relaxed CPU (idle
25%), with only 2 test clients, but anyway test failed.
> Update:
> DSE-4.8.3 also failed on OOM (3 nodes from 8), but here it survived 48 hours, not 10-12.
> Attachments:
> test2.rar -contains most of material
> more-logs.rar - contains additional nodes logs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message