Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7779AD0AC for ; Sat, 15 Sep 2012 15:23:54 +0000 (UTC) Received: (qmail 85779 invoked by uid 500); 15 Sep 2012 15:23:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 85706 invoked by uid 500); 15 Sep 2012 15:23:50 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 85697 invoked by uid 99); 15 Sep 2012 15:23:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Sep 2012 15:23:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of skrolle@gmail.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gh0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Sep 2012 15:23:42 +0000 Received: by ghbg10 with SMTP id g10so965531ghb.31 for ; Sat, 15 Sep 2012 08:23:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YcUddIoCTKF5MOi8+33chgecCXRFe4UbH80hOd99UQk=; b=trhJayV/WN/g4kSkGamyKMsYPYt5V06c5qsnLNMeMfXeV2f9Ee21uXMXpgIfECSvcv LXaped+5yekC14xks8mUQcbws9d7iFQctHsM+VBpf4Ay2swpfLB1IVBUCTn/SiZ7z3ld fcJCgf+VhXXCraVMcHGQrpLVl6znZTeJiBupk6lViPLnfyfYaY1oRH6cnPmd9DgwAPrX 8icYatCi+jxJwVcqmPLpDnsjD4SFS+x4PykBX5qgv9T2LQIuwmxghCCjpH4HvN5dg4O8 QEsg8php+arg3aNI4ZX2xLUQxqFqonUKD2QO1cjoW5qcIbEa1KouDrEK5NhCm46QeYNZ Eqlg== MIME-Version: 1.0 Received: by 10.236.117.97 with SMTP id i61mr7662344yhh.73.1347722601638; Sat, 15 Sep 2012 08:23:21 -0700 (PDT) Received: by 10.100.125.2 with HTTP; Sat, 15 Sep 2012 08:23:21 -0700 (PDT) In-Reply-To: References: Date: Sat, 15 Sep 2012 17:23:21 +0200 Message-ID: Subject: Re: Suggestions for tuning and monitoring. From: =?ISO-8859-1?Q?Henrik_Schr=F6der?= To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf3011df95ed29c904c9bf1e23 --20cf3011df95ed29c904c9bf1e23 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I started digging in the logfiles, and here's the full log of a crash. We have 4GB of heap, and if you watch either OpsCenter or through a JMX console, used heap size is always below 3GB, and it swings between 1.5 to 3GB ever 5 minutes or so. So why would it suddenly run out of heap space? INFO 14:08:50,078 Completed flushing /data-ssd/cassandra/data/system/HintsColumnFamily-hd-10-Data.db (5328100 bytes) INFO 14:08:50,079 4103 MUTATION messages dropped in last 5000ms INFO 14:08:50,080 Pool Name Active Pending Blocked INFO 14:08:50,086 Writing Memtable-A@1734568216(117729674/147162092 serialized/live bytes, 12097 ops) INFO 14:08:50,089 ReadStage 32 363 0 INFO 14:08:50,089 RequestResponseStage 0 0 0 INFO 14:08:50,089 ReadRepairStage 1 1 0 INFO 14:08:50,089 MutationStage 32 5672 0 INFO 14:08:50,090 ReplicateOnWriteStage 0 0 0 INFO 14:08:50,090 GossipStage 0 0 0 INFO 14:08:50,090 AntiEntropyStage 0 0 0 INFO 14:08:50,090 MigrationStage 0 0 0 INFO 14:08:50,090 StreamStage 0 0 0 INFO 14:08:50,091 MemtablePostFlusher 1 6 0 INFO 14:08:50,091 FlushWriter 1 5 0 INFO 14:08:50,091 MiscStage 0 0 0 INFO 14:08:50,091 InternalResponseStage 0 0 0 INFO 14:08:50,094 Compacting [SSTableReader(path=3D'/data-ssd/cassandra/data/system/HintsColumnFamily-hd= -9-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/system/HintsColumnFamily-hd-= 10-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/system/HintsColumnFamily-hd-= 8-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/system/HintsColumnFamily-hd-= 7-Data.db')] INFO 14:08:50,098 HintedHandoff 0 0 0 INFO 14:08:50,098 CompactionManager 0 2 INFO 14:08:50,099 MessagingService n/a 0,1 INFO 14:08:50,099 ColumnFamily Memtable ops,data Row cache size/cap Key cache size/cap INFO 14:08:50,099 system.NodeIdInfo 0,0 0/0 0/0 INFO 14:08:50,099 system.IndexInfo 0,0 0/0 0/0 INFO 14:08:50,099 system.LocationInfo 0,0 0/0 0/0 INFO 14:08:50,100 system.Versions 0,0 0/0 0/0 INFO 14:08:50,107 system.Migrations 0,0 0/0 0/0 INFO 14:08:50,107 system.HintsColumnFamily 0,0 0/0 0/0 INFO 14:08:50,107 system.Schema 0,0 0/0 0/0 INFO 14:08:50,108 K.A 920,12311092 0/0 0/0 INFO 14:08:50,127 K.B 111,1005040 0/0 6000/6000 INFO 14:08:50,127 K.C 13,23283 0/0 211817/211817 INFO 14:08:50,127 OpsCenter.rollups300 0,0 0/0 4/4 INFO 14:08:50,128 OpsCenter.rollups60 1876,1452889 0/0 30/30 INFO 14:08:50,128 OpsCenter.rollups7200 0,0 0/0 30/30 INFO 14:08:50,137 Enqueuing flush of Memtable-rollups60@33961585(43148/145= 2889 serialized/live bytes, 1876 ops) INFO 14:08:50,160 OpsCenter.rollups86400 0,0 0/0 0/0 INFO 14:08:50,169 OpsCenter.events 0,0 0/0 13/13 INFO 14:08:50,174 OpsCenter.events_timeline 0,0 0/0 0/0 INFO 14:08:50,175 OpsCenter.settings 0,0 0/0 0/0 INFO 14:08:50,175 OpsCenter.pdps 470,103983 0/0 0/0 INFO 14:09:09,860 GC for ConcurrentMarkSweep: 2560 ms for 6 collections, 3669729184 used; max is 4118806528 WARN 14:09:09,860 Heap is 0.8909690608317886 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN 14:09:09,860 Flushing CFS(Keyspace=3D'K', ColumnFamily=3D'A') to reli= eve memory pressure java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/lib/cassandra/java_1346605112.hprof ... Heap dump file created [5553296481 bytes in 67.950 secs] INFO 14:11:54,810 InetAddress /10.100.1.116 is now dead. INFO 14:11:54,813 InetAddress /10.100.1.117 is now dead. INFO 14:11:54,814 InetAddress /10.100.1.118 is now dead. ERROR 14:11:54,830 Fatal exception in thread Thread[CompactionExecutor:565,1,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.util.FastByteArrayOutputStream.expand(FastByteArray= OutputStream.java:104) at org.apache.cassandra.io.util.FastByteArrayOutputStream.write(FastByteArrayO= utputStream.java:220) at java.io.DataOutputStream.write(Unknown Source) at org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:323) at org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.ja= va:316) at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:63= ) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:36= 1) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:33= 4) at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFa= milySerializer.java:89) at org.apache.cassandra.db.compaction.PrecompactedRow.write(PrecompactedRow.ja= va:136) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160= ) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.ja= va:158) at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManag= er.java:135) at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManag= er.java:115) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO 14:11:54,839 InetAddress /10.100.1.116 is now UP INFO 14:11:54,849 Stop listening to thrift clients INFO 14:11:54,850 InetAddress /10.100.1.117 is now UP INFO 14:11:54,850 InetAddress /10.100.1.118 is now UP INFO 14:11:54,862 Waiting for messaging service to quiesce INFO 14:12:06,909 MessagingService shutting down server thread. ERROR 14:12:06,912 Fatal exception in thread Thread[Thread-117,5,main] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExec= ution(DebuggableThreadPoolExecutor.java:60) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:515= ) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpCo= nnection.java:157) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.ja= va:115) ERROR 14:12:06,913 Fatal exception in thread Thread[Thread-116,5,main] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExec= ution(DebuggableThreadPoolExecutor.java:60) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:515= ) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpCo= nnection.java:157) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.ja= va:115) ERROR 14:12:19,466 Fatal exception in thread Thread[Thread-118,5,main] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExec= ution(DebuggableThreadPoolExecutor.java:60) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:515= ) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpCo= nnection.java:157) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.ja= va:115) INFO 14:12:19,466 Started hinted handoff for token: 1 with IP: / 10.100.1.116 INFO 14:13:33,343 Completed flushing /data-ssd/cassandra/data/K/A-hd-279743-Data.db (43508009 bytes) INFO 14:13:33,346 flushing high-traffic column family CFS(Keyspace=3D'OpsCenter', ColumnFamily=3D'rollups7200') (estimated 0 byte= s) INFO 14:13:33,350 flushing high-traffic column family CFS(Keyspace=3D'OpsCenter', ColumnFamily=3D'rollups86400') (estimated 0 byt= es) INFO 14:13:33,351 Writing Memtable-B@797498287(674046/53923680 serialized/live bytes, 1388162 ops) INFO 14:13:33,349 Compacting [SSTableReader(path=3D'/data-ssd/cassandra/data/K/A-hd-279741-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/K/A-hd-279235-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/K/A-hd-279743-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/K/A-hd-279506-Data.db')] INFO 14:13:33,351 flushing high-traffic column family CFS(Keyspace=3D'OpsCenter', ColumnFamily=3D'events') (estimated 0 bytes) INFO 14:13:33,350 Enqueuing flush of Memtable-A@884602824(10126517/1265814= 6 serialized/live bytes, 948 ops) INFO 14:13:33,369 flushing high-traffic column family CFS(Keyspace=3D'OpsCenter', ColumnFamily=3D'events_timeline') (estimated 0 bytes) INFO 14:13:33,369 flushing high-traffic column family CFS(Keyspace=3D'OpsCenter', ColumnFamily=3D'settings') (estimated 0 bytes) INFO 14:13:33,369 flushing high-traffic column family CFS(Keyspace=3D'OpsCenter', ColumnFamily=3D'pdps') (estimated 103983 bytes) INFO 14:14:50,122 Completed flushing /data-ssd/cassandra/data/K/B-hd-14662-Data.db (5340050 bytes) INFO 14:14:50,123 26225 MUTATION messages dropped in last 5000ms INFO 14:14:50,123 Writing Memtable-C@841365631(436486/8776427 serialized/live bytes, 8752 ops) INFO 14:14:50,123 9 READ_REPAIR messages dropped in last 5000ms INFO 14:14:50,131 1519 READ messages dropped in last 5000ms INFO 14:14:50,146 Enqueuing flush of Memtable-pdps@1698224970(39228/103983 serialized/live bytes, 470 ops) INFO 14:14:50,167 Pool Name Active Pending Blocked INFO 14:14:50,169 ReadStage 20 20 0 INFO 14:14:50,169 RequestResponseStage 0 0 0 INFO 14:14:50,170 ReadRepairStage 1 1 0 INFO 14:14:50,170 ReplicateOnWriteStage 0 0 0 INFO 14:14:50,171 GossipStage 0 0 0 INFO 14:14:50,171 AntiEntropyStage 0 0 0 INFO 14:14:50,171 MigrationStage 0 0 0 INFO 14:14:50,172 StreamStage 0 0 0 INFO 14:14:50,172 MemtablePostFlusher 1 5 0 INFO 14:14:50,172 FlushWriter 1 5 1 INFO 14:14:50,173 MiscStage 0 0 0 INFO 14:14:50,173 InternalResponseStage 0 0 0 INFO 14:14:50,174 HintedHandoff 1 3 0 INFO 14:14:50,174 CompactionManager 1 4 INFO 14:14:50,174 MessagingService n/a 0,0 INFO 14:14:50,174 ColumnFamily Memtable ops,data Row cache size/cap Key cache size/cap INFO 14:14:50,194 system.NodeIdInfo 0,0 0/0 0/0 INFO 14:14:50,194 system.IndexInfo 0,0 0/0 0/0 INFO 14:14:50,194 system.LocationInfo 0,0 0/0 0/0 INFO 14:14:50,195 system.Versions 0,0 0/0 0/0 INFO 14:14:50,195 system.Migrations 0,0 0/0 0/0 INFO 14:14:50,195 system.HintsColumnFamily 13,964160 0/0 0/0 INFO 14:14:50,195 system.Schema 0,0 0/0 0/0 INFO 14:14:50,195 K.A 0,0 0/0 0/0 INFO 14:14:50,195 K.B 115,1037040 0/0 6000/6000 INFO 14:14:50,196 K.C 13,23283 0/0 211817/211817 INFO 14:14:50,196 OpsCenter.rollups300 0,0 0/0 4/4 INFO 14:14:50,196 OpsCenter.rollups60 0,0 0/0 30/30 INFO 14:14:50,196 OpsCenter.rollups7200 0,0 0/0 30/30 INFO 14:14:50,196 OpsCenter.rollups86400 0,0 0/0 0/0 INFO 14:14:50,196 OpsCenter.events 0,0 0/0 13/13 INFO 14:14:50,197 OpsCenter.events_timeline 0,0 0/0 0/0 INFO 14:14:50,197 OpsCenter.settings 0,0 0/0 0/0 INFO 14:14:50,197 OpsCenter.pdps 0,0 0/0 0/0 Service exit with a return value of 100 On Tue, Sep 11, 2012 at 3:57 AM, aaron morton wrot= e: > It's impossible to start new connections, or impossible to send requests, > or it just doesn't return anything when you've sent a request. > > If it's totally frozen it sounds like GC. How long does it freeze for? > > Despite that, we occasionally get OOM exceptions, and nodes crashing, > maybe a few times per month. > > Do you have an error stack ? > > We can't find anything in the cassandra logs indicating that something's = up > > Is it logging dropped messages or high TP pending ? Are the freezes > associated with compaction or repair running? > > and occasionally we do bulk deletion of supercolumns in a row. > > mmm, are you sending a batch mutation with lots-o-deletes ? Each row > mutation (insert or delete) in the batch becomes a thread pool tasks. If > you send 1,000 rows in a batch you will temporarily prevent other request= s > from being served. > > The config options we are unsure about are things like commit log sizes, = =85. > > I would try to find some indication of what's going on before tweaking. > Have you checked iostat ? > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 11/09/2012, at 2:05 AM, Henrik Schr=F6der wrote: > > Hi all, > > We're running a small Cassandra cluster (v1.0.10) serving data to our web > application, and as our traffic grows, we're starting to see some weird > issues. The biggest of these is that sometimes, a single node becomes > unresponsive. It's impossible to start new connections, or impossible to > send requests, or it just doesn't return anything when you've sent a > request. Our client library is set to retry on another server when this > happens, and what we see then is that the request is usually served > instantly. So it's not the case that some requests are very difficult, it= 's > that sometimes a node is just "busy", and we have no idea why or what it'= s > doing. > > We're using MRTG and Monit to monitor the servers, and in MRTG the averag= e > CPU usage is around 5%, on our quad-core Xeon servers with SSDs. But > occassionally through Monit we can see that the 1-min load average goes > above 4, and this usually corresponds to the above issues. Is this common= ? > Does this happen to everyone else? And why the spikiness in load? We can'= t > find anything in the cassandra logs indicating that something's up (such = as > a slow GC or compaction), and there's no corresponding traffic spike in t= he > application either. Should we just add more nodes if any single one gets > CPU spikes? > > Another explanation could also be that we've configured it wrong. We're > running pretty much default config. Each node has 16G of RAM, 4GB of heap= , > no row-cache and an sizeable key-cache. Despite that, we occasionally get > OOM exceptions, and nodes crashing, maybe a few times per month. Should w= e > increase heap size? Or move to 1.1 and enable off-heap caching? > > We have quite a lot of traffic to the cluster. A single keyspace with two > column families, RF=3D3, compression is enabled, and we're running the > default size-tiered compaction. > Column family A has 60GB of actual data, each row has a single column, an= d > that column holds binary data that varies in size up to 500kB. When we > update a row, we write a new value to this single column, effectively > replacing that entire row. We do ~1000 updates/s, totalling ~10MB/s in > writes. > Column family B also has 60GB of actual data, but each row has a variable > (~100-10000) number of supercolumns, and each supercolumn has a fixed > number of columns with a fixed amount of data, totalling ~1kB. The > operations we are doing on this column family is that we add supercolumns > to rows with a rate of ~200/s, and occasionally we do bulk deletion of > supercolumns in a row. > > The config options we are unsure about are things like commit log sizes, > memtable flushing thresholds, commit log syncing, compaction throughput, > etc. Are we at a point with our data size and write load that the default= s > aren't good enough anymore? Should we stick with size-tiered compaction, > even though our application is update-heavy? > > > Many thanks, > /Henrik > > > --20cf3011df95ed29c904c9bf1e23 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I started digging in the logfiles, and here's the full log of a crash. = We have 4GB of heap, and if you watch either OpsCenter or through a JMX con= sole, used heap size is always below 3GB, and it swings between 1.5 to 3GB = ever 5 minutes or so. So why would it suddenly run out of heap space?


=A0INFO 14:08:50,078 Completed flushing /data-ssd/cassandra/data/sy= stem/HintsColumnFamily-hd-10-Data.db (5328100 bytes)
=A0INFO 14:08:50,07= 9 4103 MUTATION messages dropped in last 5000ms
=A0INFO 14:08:50,080 Poo= l Name=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Active=A0= =A0 Pending=A0=A0 Blocked
=A0INFO 14:08:50,086 Writing Memtable-A@1734568216(117729674/147162092 seri= alized/live bytes, 12097 ops)
=A0INFO 14:08:50,089 ReadStage=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 32=A0=A0=A0=A0=A0= =A0 363=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0INFO 14:08:50,089 RequestResponseSt= age=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0
=A0INFO 14:08:50,089 ReadRepairStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 0
= =A0INFO 14:08:50,089 MutationStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 32=A0=A0=A0=A0=A0 5672=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0I= NFO 14:08:50,090 ReplicateOnWriteStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 = 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0INFO 14:08:50,090 GossipStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0= =A0 0
=A0INFO 14:08:50,090 AntiEntropyStage=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0= =A0 0
=A0INFO 14:08:50,090 MigrationStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0= =A0=A0 0
=A0INFO 14:08:50,090 StreamStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0= =A0 0
=A0INFO 14:08:50,091 MemtablePostFlusher=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 6=A0=A0=A0=A0=A0=A0=A0=A0 0=A0INFO 14:08:50,091 FlushWriter=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 5=A0=A0=A0=A0=A0=A0=A0= =A0 0
=A0INFO 14:08:50,091 MiscStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0= =A0=A0 0
=A0INFO 14:08:50,091 InternalResponseStage=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
= =A0INFO 14:08:50,094 Compacting [SSTableReader(path=3D'/data-ssd/cassan= dra/data/system/HintsColumnFamily-hd-9-Data.db'), SSTableReader(path=3D= '/data-ssd/cassandra/data/system/HintsColumnFamily-hd-10-Data.db'),= SSTableReader(path=3D'/data-ssd/cassandra/data/system/HintsColumnFamil= y-hd-8-Data.db'), SSTableReader(path=3D'/data-ssd/cassandra/data/sy= stem/HintsColumnFamily-hd-7-Data.db')]
=A0INFO 14:08:50,098 HintedHandoff=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=
=A0INFO 14:08:50,098 CompactionManager=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 2
=A0INFO 14:08:50,099 Messagi= ngService=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 n/a=A0=A0=A0=A0=A0= =A0 0,1
=A0INFO 14:08:50,099 ColumnFamily=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 Memtable ops,data=A0 Row cache size/cap=A0 Key cache size/c= ap
=A0INFO 14:08:50,099 system.NodeIdInfo=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
= =A0INFO 14:08:50,099 system.IndexInfo=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0<= br>=A0INFO 14:08:50,099 system.LocationInfo=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:08:50,100 system.Versions=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0<= br>=A0INFO 14:08:50,107 system.Migrations=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0<= br>=A0INFO 14:08:50,107 system.HintsColumnFamily=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:08:50,107 system.Schema=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0
=A0INFO 14:08:50,108 K.A=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 920,12311092=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0/0
=A0INFO 14:08:50,127 K.B=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 111,1005040= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0 6000/6000
=A0INFO 14:08:50,127 K.C=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 13,23283=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0 211817/211817
=A0IN= FO 14:08:50,127 OpsCenter.rollups300=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 4/4
=A0INFO 14:0= 8:50,128 OpsCenter.rollups60=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 1876,14= 52889=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 30/30
=A0INFO 14:08:50,128 OpsCenter.rollups7200=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 30/30
=A0INFO 14:08= :50,137 Enqueuing flush of Memtable-rollups60@33961585(43148/1452889 serial= ized/live bytes, 1876 ops)
=A0INFO 14:08:50,160 OpsCenter.rollups86400=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 1= 4:08:50,169 OpsCenter.events=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 13/13
=A0INFO 14= :08:50,174 OpsCenter.events_timeline=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:08:50,175 OpsCenter.settings=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0= INFO 14:08:50,175 OpsCenter.pdps=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 470,103983=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:0= 9:09,860 GC for ConcurrentMarkSweep: 2560 ms for 6 collections, 3669729184 = used; max is 4118806528
=A0WARN 14:09:09,860 Heap is 0.8909690608317886 full.=A0 You may need to re= duce memtable and/or cache sizes.=A0 Cassandra will now flush up to the two= largest memtables to free up memory.=A0 Adjust flush_largest_memtables_at = threshold in cassandra.yaml if you don't want Cassandra to do this auto= matically
=A0WARN 14:09:09,860 Flushing CFS(Keyspace=3D'K', ColumnFamily=3D&#= 39;A') to relieve memory pressure
java.lang.OutOfMemoryError: Java h= eap space
Dumping heap to /var/lib/cassandra/java_1346605112.hprof ... Heap dump file created [5553296481 bytes in 67.950 secs]
=A0INFO 14:11:5= 4,810 InetAddress /10.100.1.116 is now = dead.
=A0INFO 14:11:54,813 InetAddress /= 10.100.1.117 is now dead.
=A0INFO 14:11:54,814 InetAddress /10.100.1.= 118 is now dead.
ERROR 14:11:54,830 Fatal exception in thread Thread= [CompactionExecutor:565,1,main]
java.lang.OutOfMemoryError: Java heap sp= ace
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.io.util.FastByteArrayOutputSt= ream.expand(FastByteArrayOutputStream.java:104)
=A0=A0=A0=A0=A0=A0=A0 at= org.apache.cassandra.io.util.FastByteArrayOutputStream.write(FastByteArray= OutputStream.java:220)
=A0=A0=A0=A0=A0=A0=A0 at java.io.DataOutputStream.write(Unknown Source)
= =A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.utils.ByteBufferUtil.write(By= teBufferUtil.java:323)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.uti= ls.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:316)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.ColumnSerializer.serialize= (ColumnSerializer.java:63)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra= .db.SuperColumnSerializer.serialize(SuperColumn.java:361)
=A0=A0=A0=A0= =A0=A0=A0 at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperC= olumn.java:334)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.ColumnFamilySerializer.ser= ializeForSSTable(ColumnFamilySerializer.java:89)
=A0=A0=A0=A0=A0=A0=A0 a= t org.apache.cassandra.db.compaction.PrecompactedRow.write(PrecompactedRow.= java:136)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.io.sstable.SSTab= leWriter.append(SSTableWriter.java:160)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.compaction.CompactionTask.= execute(CompactionTask.java:158)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cas= sandra.db.compaction.CompactionManager$1.call(CompactionManager.java:135)=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.compaction.CompactionMan= ager$1.call(CompactionManager.java:115)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.FutureTask$Sync.innerRun(Unkn= own Source)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.FutureTask.run= (Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoo= lExecutor$Worker.runTask(Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExecutor$Worker.run= (Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.run(Unknown S= ource)
=A0INFO 14:11:54,839 InetAddress /10.100.1.116 is now UP
=A0INFO 14:11:54,849 Stop listening to thrift clients
=A0INFO 14:11:54,8= 50 InetAddress /10.100.1.117 is now UP<= br>=A0INFO 14:11:54,850 InetAddress /10.100= .1.118 is now UP
=A0INFO 14:11:54,862 Waiting for messaging service to quiesce
=A0INFO 14= :12:06,909 MessagingService shutting down server thread.
ERROR 14:12:06,= 912 Fatal exception in thread Thread[Thread-117,5,main]
java.util.concur= rent.RejectedExecutionException: ThreadPoolExecutor has shut down
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.concurrent.DebuggableThreadPo= olExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
=A0= =A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExecutor.reject(Unknow= n Source)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExecut= or.execute(Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.net.MessagingService.receive(= MessagingService.java:515)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra= .net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:157)=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.net.IncomingTcpConnection.r= un(IncomingTcpConnection.java:115)
ERROR 14:12:06,913 Fatal exception in thread Thread[Thread-116,5,main]
j= ava.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut= down
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.concurrent.Debuggabl= eThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:6= 0)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExecutor.reject(Unk= nown Source)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExe= cutor.execute(Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassan= dra.net.MessagingService.receive(MessagingService.java:515)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.net.IncomingTcpConnection.rec= eiveMessage(IncomingTcpConnection.java:157)
=A0=A0=A0=A0=A0=A0=A0 at org= .apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:= 115)
ERROR 14:12:19,466 Fatal exception in thread Thread[Thread-118,5,ma= in]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shu= t down
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.concurrent.Debuggab= leThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:= 60)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExecutor.rej= ect(Unknown Source)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.ThreadPoolExecutor.execute(Un= known Source)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.net.Messagin= gService.receive(MessagingService.java:515)
=A0=A0=A0=A0=A0=A0=A0 at org= .apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConne= ction.java:157)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.net.IncomingTcpConnection.run= (IncomingTcpConnection.java:115)
=A0INFO 14:12:19,466 Started hinted han= doff for token: 1 with IP: /10.100.1.116
=A0INFO 14:13:33,343 Completed flushing /data-ssd/cassandra/data/K/A-h= d-279743-Data.db (43508009 bytes)
=A0INFO 14:13:33,346 flushing high-traffic column family CFS(Keyspace=3D= 9;OpsCenter', ColumnFamily=3D'rollups7200') (estimated 0 bytes)=
=A0INFO 14:13:33,350 flushing high-traffic column family CFS(Keyspace= =3D'OpsCenter', ColumnFamily=3D'rollups86400') (estimated 0= bytes)
=A0INFO 14:13:33,351 Writing Memtable-B@797498287(674046/53923680 serialize= d/live bytes, 1388162 ops)
=A0INFO 14:13:33,349 Compacting [SSTableReade= r(path=3D'/data-ssd/cassandra/data/K/A-hd-279741-Data.db'), SSTable= Reader(path=3D'/data-ssd/cassandra/data/K/A-hd-279235-Data.db'), SS= TableReader(path=3D'/data-ssd/cassandra/data/K/A-hd-279743-Data.db'= ), SSTableReader(path=3D'/data-ssd/cassandra/data/K/A-hd-279506-Data.db= ')]
=A0INFO 14:13:33,351 flushing high-traffic column family CFS(Keyspace=3D= 9;OpsCenter', ColumnFamily=3D'events') (estimated 0 bytes)
= =A0INFO 14:13:33,350 Enqueuing flush of Memtable-A@884602824(10126517/12658= 146 serialized/live bytes, 948 ops)
=A0INFO 14:13:33,369 flushing high-traffic column family CFS(Keyspace=3D= 9;OpsCenter', ColumnFamily=3D'events_timeline') (estimated 0 by= tes)
=A0INFO 14:13:33,369 flushing high-traffic column family CFS(Keyspa= ce=3D'OpsCenter', ColumnFamily=3D'settings') (estimated 0 b= ytes)
=A0INFO 14:13:33,369 flushing high-traffic column family CFS(Keyspace=3D= 9;OpsCenter', ColumnFamily=3D'pdps') (estimated 103983 bytes)=A0INFO 14:14:50,122 Completed flushing /data-ssd/cassandra/data/K/B-hd-1= 4662-Data.db (5340050 bytes)
=A0INFO 14:14:50,123 26225 MUTATION messages dropped in last 5000ms
=A0I= NFO 14:14:50,123 Writing Memtable-C@841365631(436486/8776427 serialized/liv= e bytes, 8752 ops)
=A0INFO 14:14:50,123 9 READ_REPAIR messages dropped i= n last 5000ms
=A0INFO 14:14:50,131 1519 READ messages dropped in last 5000ms
=A0INFO 1= 4:14:50,146 Enqueuing flush of Memtable-pdps@1698224970(39228/103983 serial= ized/live bytes, 470 ops)
=A0INFO 14:14:50,167 Pool Name=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Active=A0=A0 Pending=A0=A0 Block= ed
=A0INFO 14:14:50,169 ReadStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 20=A0=A0=A0=A0=A0=A0=A0 20=A0=A0=A0=A0=A0=A0=A0=A0= 0
=A0INFO 14:14:50,169 RequestResponseStage=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0IN= FO 14:14:50,170 ReadRepairStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0INFO 14:14:50,170 ReplicateOnWriteStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0INFO 14:14:= 50,171 GossipStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0INFO = 14:14:50,171 AntiEntropyStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
=A0INFO 14:14:50,171 MigrationStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0INFO 14:14:50,172 StreamStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0= =A0 0
=A0INFO 14:14:50,172 MemtablePostFlusher=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 5=A0=A0=A0=A0=A0=A0=A0=A0 0 =A0INFO 14:14:50,172 FlushWriter=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 5=A0=A0=A0=A0=A0=A0=A0= =A0 1
=A0INFO 14:14:50,173 MiscStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 0
=A0INFO 14:14:50,173 InternalResponseStage=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0= =A0 0
=A0INFO 14:14:50,174 HintedHandoff=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 3=A0=A0=A0=A0=A0=A0=A0=A0 0=
=A0INFO 14:14:50,174 CompactionManager=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 4
=A0INFO 14:14:50,174 Messagi= ngService=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 n/a=A0=A0=A0=A0=A0= =A0 0,0
=A0INFO 14:14:50,174 ColumnFamily=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 Memtable ops,data=A0 Row cache size/cap=A0 Key cache size/c= ap
=A0INFO 14:14:50,194 system.NodeIdInfo=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
= =A0INFO 14:14:50,194 system.IndexInfo=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0<= br>=A0INFO 14:14:50,194 system.LocationInfo=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:14:50,195 system.Versions=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0<= br>=A0INFO 14:14:50,195 system.Migrations=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0<= br>=A0INFO 14:14:50,195 system.HintsColumnFamily=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 13,964160=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:14:50,195 system.Schema=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0
=A0INFO 14:14:50,195 K.A=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:14:50,195 K.B=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 115,1037040=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 6000/6000
=A0INFO 14:14:50,196 K.C=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 13,23283=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0=A0=A0=A0=A0 211817/211817
=A0IN= FO 14:14:50,196 OpsCenter.rollups300=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 4/4
=A0INFO 14:1= 4:50,196 OpsCenter.rollups60=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0= /0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 30/30
=A0INFO 14:14:50,196 OpsCenter.rollups7200=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 30/30
=A0INFO 14:14= :50,196 OpsCenter.rollups86400=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:14:50,196 OpsC= enter.events=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 13/13
=A0INFO 14:14:50,197 OpsCenter.events_timeline=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0= /0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:14:50,= 197 OpsCenter.settings=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
=A0INFO 14:14:50,19= 7 OpsCenter.pdps=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 0,0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0/0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0/0
Service exit with a return value of 100



On Tue, Sep 11, 2012 at 3:57 AM, aaron morton <= aaron@thelastp= ickle.com> wrote:
It's impossible to start new connec= tions, or impossible to send requests, or it just doesn't return anythi= ng when you've sent a request.
If it's totally frozen it sounds like GC. How long does it f= reeze for?

Despit= e that, we occasionally get OOM exceptions, and nodes crashing, maybe a few= times per month.=A0
Do you have an error stack ?

We can't find anything in the cassandra logs ind= icating that something's up
Is it logging dropped mes= sages or high TP pending ? Are the freezes associated with compaction or re= pair running?

=A0and occa= sionally we do bulk deletion of supercolumns in a row.
mm= m, are you sending a batch mutation with lots-o-deletes ? Each row mutation= (insert or delete) in the batch becomes a thread pool tasks. If you send 1= ,000 rows in a batch you will temporarily prevent other requests from being= served.

The config options we are uns= ure about are things like commit log sizes,=A0=85.
I would try = to find some indication of what's going on before tweaking. Have you ch= ecked iostat ?

Hope that helps.=A0

<= div style=3D"word-wrap:break-word">
-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 11/09/2012, at 2:05 AM, Henrik Schr=F6der <skrolle@gmail.com> wrote:=

Hi all,

We're running a smal= l Cassandra cluster (v1.0.10) serving data to our web application, and as o= ur traffic grows, we're starting to see some weird issues. The biggest = of these is that sometimes, a single node becomes unresponsive. It's im= possible to start new connections, or impossible to send requests, or it ju= st doesn't return anything when you've sent a request. Our client l= ibrary is set to retry on another server when this happens, and what we see= then is that the request is usually served instantly. So it's not the = case that some requests are very difficult, it's that sometimes a node = is just "busy", and we have no idea why or what it's doing.
We're using MRTG and Monit to monitor the servers, and in MRTG the = average CPU usage is around 5%, on our quad-core Xeon servers with SSDs. Bu= t occassionally through Monit we can see that the 1-min load average goes a= bove 4, and this usually corresponds to the above issues. Is this common? D= oes this happen to everyone else? And why the spikiness in load? We can'= ;t find anything in the cassandra logs indicating that something's up (= such as a slow GC or compaction), and there's no corresponding traffic = spike in the application either. Should we just add more nodes if any singl= e one gets CPU spikes?

Another explanation could also be that we've configured it wrong. W= e're running pretty much default config. Each node has 16G of RAM, 4GB = of heap, no row-cache and an sizeable key-cache. Despite that, we occasiona= lly get OOM exceptions, and nodes crashing, maybe a few times per month. Sh= ould we increase heap size? Or move to 1.1 and enable off-heap caching?

We have quite a lot of traffic to the cluster. A single keyspace with t= wo column families, RF=3D3, compression is enabled, and we're running t= he default size-tiered compaction.
Column family A has 60GB of actual da= ta, each row has a single column, and that column holds binary data that va= ries in size up to 500kB. When we update a row, we write a new value to thi= s single column, effectively replacing that entire row. We do ~1000 updates= /s, totalling ~10MB/s in writes.
Column family B also has 60GB of actual data, but each row has a variable (= ~100-10000) number of supercolumns, and each supercolumn has a fixed number= of columns with a fixed amount of data, totalling ~1kB. The operations we = are doing on this column family is that we add supercolumns to rows with a = rate of ~200/s, and occasionally we do bulk deletion of supercolumns in a r= ow.

The config options we are unsure about are things like commit log sizes= , memtable flushing thresholds, commit log syncing, compaction throughput, = etc. Are we at a point with our data size and write load that the defaults = aren't good enough anymore? Should we stick with size-tiered compaction= , even though our application is update-heavy?


Many thanks,
/Henrik

--20cf3011df95ed29c904c9bf1e23--