Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2DD1D10D2F for ; Tue, 26 Nov 2013 09:46:39 +0000 (UTC) Received: (qmail 43845 invoked by uid 500); 26 Nov 2013 09:46:36 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 43808 invoked by uid 500); 26 Nov 2013 09:46:36 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 43789 invoked by uid 99); 26 Nov 2013 09:46:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Nov 2013 09:46:35 +0000 Date: Tue, 26 Nov 2013 09:46:35 +0000 (UTC) From: "Jason Harvey (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Harvey updated CASSANDRA-6405: ------------------------------------ Reproduced In: 1.2.11, 1.1.7, 1.0.12 (was: 1.0.12, 1.1.7, 1.2.11) Description: We're randomly running into an interesting issue on our ring. When making use of counters, we'll occasionally have 3 nodes (always neighbors) suddenly start immediately filling up memory, CMSing, fill up again, repeat. This pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out during this period. Restarting one, two, or all three of the nodes does not resolve the spiral; after a restart the three nodes immediately start hogging up memory again and CMSing constantly. When the issue resolves itself, all 3 nodes immediately get better. Sometimes it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, trashed for 20, and repeat that cycle a few times. There are no unusual logs provided by cassandra during this period of time, other than recording of the constant dropped read requests and the constant CMS runs. I have analyzed the log files prior to multiple distinct instances of this issue and have found no preceding events which are associated with this issue. I have verified that our apps are not performing any unusual number or type of requests during this time. This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11. I managed to get an hprof dump if the issue happening in the wild. Something notable in the class instance counts. Here are the top 5 counts for this one node: {code} 5967846 instances of class org.apache.cassandra.db.CounterColumn 1247525 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey {code} Is it normal for CounterColumn to have that number of instances? The data model for how we use counters is as follows: between 50-20000 counter columns per key. We currently have around 3 million keys total, but this issue also replicated when we only had a few thousand keys total. Average column count is around 1k, and 90th is 18k. New columns are added regularly, and columns are incremented regularly. No column or key deletions occur. We probably have 1-5k "hot" keys at any given time, spread across the entire ring. R:W ratio is typically around 50:1. This is the only CF we're using counters on, at this time. CF details are as follows: {code} ColumnFamily: CommentTree Key Validation Class: org.apache.cassandra.db.marshal.AsciiType Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.01 DC Local Read repair chance: 0.0 Populate IO Cache on flush: false Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compaction Strategy Options: sstable_size_in_mb: 160 Column Family: CommentTree SSTable count: 30 SSTables in each level: [1, 10, 19, 0, 0, 0, 0, 0, 0] Space used (live): 4656930594 Space used (total): 4677221791 SSTable Compression Ratio: 0.0 Number of Keys (estimate): 679680 Memtable Columns Count: 8289 Memtable Data Size: 2639908 Memtable Switch Count: 5769 Read Count: 185479324 Read Latency: 1.786 ms. Write Count: 5377562 Write Latency: 0.026 ms. Pending Tasks: 0 Bloom Filter False Positives: 2914204 Bloom Filter False Ratio: 0.56403 Bloom Filter Space Used: 523952 Compacted row minimum size: 30 Compacted row maximum size: 4866323 Compacted row mean size: 7742 Average live cells per slice (last five minutes): 39.0 Average tombstones per slice (last five minutes): 0.0 {code} Please let me know if I can provide any further information. I can provide the hprof if desired, however it is 3GB so I'll need to provide it outside of JIRA. was: We're randomly running into an interesting issue on our ring. When making use of counters, we'll occasionally have 3 nodes (always neighbors) suddenly start immediately filling up memory, CMSing, fill up again, repeat. This pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out during this period. Restarting one, two, or all three of the nodes does not resolve the spiral; after a restart the three nodes immediately start hogging up memory again and CMSing constantly. When the issue resolves itself, all 3 nodes immediately get better. Sometimes it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, trashed for 20, and repeat that cycle a few times. There are no unusual logs provided by cassandra during this period of time, other than recording of the constant dropped read requests and the constant CMS runs. I have analyzed the log files prior to multiple distinct instances of this issue and have found no preceding events which are associated with this issue. I have verified that our apps are not performing any unusual number or type of requests during this time. This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11. I managed to get an hprof dump if the issue happening in the wild. Something notable in the class instance counts. Here are the top 5 counts for this one node: ``` 5967846 instances of class org.apache.cassandra.db.CounterColumn 1247525 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey ``` Is it normal for CounterColumn to have that number of instances? The data model for how we use counters is as follows: between 50-20000 counter columns per key. We currently have around 3 million keys total, but this issue also replicated when we only had a few thousand keys total. Average column count is around 1k, and 90th is 18k. New columns are added regularly, and columns are incremented regularly. No column or key deletions occur. We probably have 1-5k "hot" keys at any given time, spread across the entire ring. R:W ratio is typically around 50:1. This is the only CF we're using counters on, at this time. CF details are as follows: ``` ColumnFamily: CommentTree Key Validation Class: org.apache.cassandra.db.marshal.AsciiType Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.01 DC Local Read repair chance: 0.0 Populate IO Cache on flush: false Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compaction Strategy Options: sstable_size_in_mb: 160 Column Family: CommentTree SSTable count: 30 SSTables in each level: [1, 10, 19, 0, 0, 0, 0, 0, 0] Space used (live): 4656930594 Space used (total): 4677221791 SSTable Compression Ratio: 0.0 Number of Keys (estimate): 679680 Memtable Columns Count: 8289 Memtable Data Size: 2639908 Memtable Switch Count: 5769 Read Count: 185479324 Read Latency: 1.786 ms. Write Count: 5377562 Write Latency: 0.026 ms. Pending Tasks: 0 Bloom Filter False Positives: 2914204 Bloom Filter False Ratio: 0.56403 Bloom Filter Space Used: 523952 Compacted row minimum size: 30 Compacted row maximum size: 4866323 Compacted row mean size: 7742 Average live cells per slice (last five minutes): 39.0 Average tombstones per slice (last five minutes): 0.0 ``` Please let me know if I can provide any further information. I can provide the hprof if desired, however it is 3GB so I'll need to provide it outside of JIRA. > When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion > --------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-6405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6405 > Project: Cassandra > Issue Type: Bug > Environment: RF of 3, 15 nodes. > Sun Java 7. > Xmx of 8G. > No row cache. > Reporter: Jason Harvey > > We're randomly running into an interesting issue on our ring. When making use of counters, we'll occasionally have 3 nodes (always neighbors) suddenly start immediately filling up memory, CMSing, fill up again, repeat. This pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out during this period. Restarting one, two, or all three of the nodes does not resolve the spiral; after a restart the three nodes immediately start hogging up memory again and CMSing constantly. > When the issue resolves itself, all 3 nodes immediately get better. Sometimes it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, trashed for 20, and repeat that cycle a few times. > There are no unusual logs provided by cassandra during this period of time, other than recording of the constant dropped read requests and the constant CMS runs. I have analyzed the log files prior to multiple distinct instances of this issue and have found no preceding events which are associated with this issue. > I have verified that our apps are not performing any unusual number or type of requests during this time. > This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11. > I managed to get an hprof dump if the issue happening in the wild. Something notable in the class instance counts. Here are the top 5 counts for this one node: > {code} > 5967846 instances of class org.apache.cassandra.db.CounterColumn > 1247525 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue > 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey > {code} > Is it normal for CounterColumn to have that number of instances? > The data model for how we use counters is as follows: between 50-20000 counter columns per key. We currently have around 3 million keys total, but this issue also replicated when we only had a few thousand keys total. Average column count is around 1k, and 90th is 18k. New columns are added regularly, and columns are incremented regularly. No column or key deletions occur. We probably have 1-5k "hot" keys at any given time, spread across the entire ring. R:W ratio is typically around 50:1. This is the only CF we're using counters on, at this time. CF details are as follows: > {code} > ColumnFamily: CommentTree > Key Validation Class: org.apache.cassandra.db.marshal.AsciiType > Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType > Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType) > GC grace seconds: 864000 > Compaction min/max thresholds: 4/32 > Read repair chance: 0.01 > DC Local Read repair chance: 0.0 > Populate IO Cache on flush: false > Replicate on write: true > Caching: KEYS_ONLY > Bloom Filter FP chance: default > Built indexes: [] > Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy > Compaction Strategy Options: > sstable_size_in_mb: 160 > Column Family: CommentTree > SSTable count: 30 > SSTables in each level: [1, 10, 19, 0, 0, 0, 0, 0, 0] > Space used (live): 4656930594 > Space used (total): 4677221791 > SSTable Compression Ratio: 0.0 > Number of Keys (estimate): 679680 > Memtable Columns Count: 8289 > Memtable Data Size: 2639908 > Memtable Switch Count: 5769 > Read Count: 185479324 > Read Latency: 1.786 ms. > Write Count: 5377562 > Write Latency: 0.026 ms. > Pending Tasks: 0 > Bloom Filter False Positives: 2914204 > Bloom Filter False Ratio: 0.56403 > Bloom Filter Space Used: 523952 > Compacted row minimum size: 30 > Compacted row maximum size: 4866323 > Compacted row mean size: 7742 > Average live cells per slice (last five minutes): 39.0 > Average tombstones per slice (last five minutes): 0.0 > {code} > Please let me know if I can provide any further information. I can provide the hprof if desired, however it is 3GB so I'll need to provide it outside of JIRA. -- This message was sent by Atlassian JIRA (v6.1#6144)