Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A22101801F for ; Wed, 2 Dec 2015 10:21:26 +0000 (UTC) Received: (qmail 71521 invoked by uid 500); 2 Dec 2015 10:21:23 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 71476 invoked by uid 500); 2 Dec 2015 10:21:23 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 71464 invoked by uid 99); 2 Dec 2015 10:21:23 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Dec 2015 10:21:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 8E913C0FDE for ; Wed, 2 Dec 2015 10:21:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.122 X-Spam-Level: *** X-Spam-Status: No, score=3.122 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, KAM_HUGEIMGSRC=0.2, T_KAM_HTML_FONT_INVALID=0.01, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001, URI_TRY_3LD=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id BUuCIhcWSHTi for ; Wed, 2 Dec 2015 10:21:08 +0000 (UTC) Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 8193542AC6 for ; Wed, 2 Dec 2015 10:21:08 +0000 (UTC) Received: by obbbj7 with SMTP id bj7so28866272obb.1 for ; Wed, 02 Dec 2015 02:21:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=kiNQIiqI/QLDj/NZyJnkMz0AzKFuvZOK7kurHNa1Sw4=; b=iAw/KQnX9Sg4CwFSCVxq0IUswJaWJc/5najQxYACAT/AMw1fmNPx0NGwZZPQyn5JfU lhJfOZgimt+VcXekOMI/16hTwqjlicQrngi0e9ZAoaDEs45YHNRElsV9IRvSxkywtHOZ AqXj4bqn4eLF0STeaU+xOTzzTZUOT4p2kCnWLE0lDKfi/lUQdN0bI4qujnzCNB+wN+sX JSqDFTZqw1xgpxAXVN5zGOZGDa1ItOWzymgSQ/tKq9FHC/OJqZEgQFRmqzYpe1fy+kDe Lz60ueI2eErklSA4VmCSyQyktRgnL99d/FMAC6tePB+iGLpTEZAKbC5wn5nzb/Ehomc1 EcWQ== MIME-Version: 1.0 X-Received: by 10.182.76.68 with SMTP id i4mr1787433obw.33.1449051667909; Wed, 02 Dec 2015 02:21:07 -0800 (PST) Received: by 10.202.73.209 with HTTP; Wed, 2 Dec 2015 02:21:07 -0800 (PST) In-Reply-To: References: Date: Wed, 2 Dec 2015 11:21:07 +0100 Message-ID: Subject: Re: Cassandra compaction stuck? Should I disable? From: "PenguinWhispererThe ." To: Sebastian Estevez Cc: Robert Coli , user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7b6700cfed51ca0525e7a1d2 --047d7b6700cfed51ca0525e7a1d2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable So it seems I found the problem. The node opening a stream is waiting for the other node to respond but that node never responds due to a broken pipe which makes Cassandra wait forever= . It's basically this issue: https://issues.apache.org/jira/browse/CASSANDRA-8472 And this is the workaround/fix: https://issues.apache.org/jira/browse/CASSANDRA-8611 So: - update cassandra to >=3D2.0.11 - add option streaming_socket_timeout_in_ms =3D 10000 - do rolling restart of cassandra What's weird is that the IOException: Broken pipe is never shown in my logs (not on any node). And my logging is set to INFO in log4j config. I have this config in log4j-server.properties: # output messages into a rolling log file as well as stdout log4j.rootLogger=3DINFO,stdout,R # stdout log4j.appender.stdout=3Dorg.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=3Dorg.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=3D%5p %d{HH:mm:ss,SSS} %m%n # rolling log file log4j.appender.R=3Dorg.apache.log4j.RollingFileAppender log4j.appender.R.maxFileSize=3D20MB log4j.appender.R.maxBackupIndex=3D50 log4j.appender.R.layout=3Dorg.apache.log4j.PatternLayout log4j.appender.R.layout.ConversionPattern=3D%5p [%t] %d{ISO8601} %F (line %= L) %m%n # Edit the next line to point to your logs directory log4j.appender.R.File=3D/var/log/cassandra/system.log # Application logging options #log4j.logger.org.apache.cassandra=3DDEBUG #log4j.logger.org.apache.cassandra.db=3DDEBUG #log4j.logger.org.apache.cassandra.service.StorageProxy=3DDEBUG # Adding this to avoid thrift logging disconnect errors. log4j.logger.org.apache.thrift.server.TNonblockingServer=3DERROR Too bad nobody else could point to those. Hope it helps someone else from wasting a lot of time. 2015-11-11 15:42 GMT+01:00 Sebastian Estevez : > Use 'nodetool compactionhistory' > > all the best, > > Sebasti=C3=A1n > On Nov 11, 2015 3:23 AM, "PenguinWhispererThe ." < > th3penguinwhisperer@gmail.com> wrote: > >> Does compactionstats shows only stats for completed compactions (100%)? >> It might be that the compaction is running constantly, over and over aga= in. >> In that case I need to know what I might be able to do to stop this >> constant compaction so I can start a nodetool repair. >> >> Note that there is a lot of traffic on this columnfamily so I'm not sure >> if temporary disabling compaction is an option. The repair will probably >> take long as well. >> >> Sebastian and Rob: do you might have any more ideas about the things I >> put in this thread? Any help is appreciated! >> >> 2015-11-10 20:03 GMT+01:00 PenguinWhispererThe . < >> th3penguinwhisperer@gmail.com>: >> >>> Hi Sebastian, >>> >>> Thanks for your response. >>> >>> No swap is used. No offense, I just don't see a reason why having swap >>> would be the issue here. I put swapiness on 1. I also have jna installe= d. >>> That should prevent java being swapped out as wel AFAIK. >>> >>> >>> 2015-11-10 19:50 GMT+01:00 Sebastian Estevez < >>> sebastian.estevez@datastax.com>: >>> >>>> Turn off Swap. >>>> >>>> >>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRec= ommendSettings.html?scroll=3Dreference_ds_sxl_gf3_2k__disable-swap >>>> >>>> >>>> All the best, >>>> >>>> >>>> [image: datastax_logo.png] >>>> >>>> Sebasti=C3=A1n Est=C3=A9vez >>>> >>>> Solutions Architect | 954 905 8615 | sebastian.estevez@datastax.com >>>> >>>> [image: linkedin.png] [ima= ge: >>>> facebook.png] [image: twitter.png] >>>> [image: g+.png] >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> DataStax is the fastest, most scalable distributed database >>>> technology, delivering Apache Cassandra to the world=E2=80=99s most in= novative >>>> enterprises. Datastax is built to be agile, always-on, and predictably >>>> scalable to any size. With more than 500 customers in 45 countries, Da= taStax >>>> is the database technology and transactional backbone of choice for th= e >>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and e= Bay. >>>> >>>> On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . < >>>> th3penguinwhisperer@gmail.com> wrote: >>>> >>>>> I also have the following memory usage: >>>>> [root@US-BILLINGDSX4 cassandra]# free -m >>>>> total used free shared buffers >>>>> cached >>>>> Mem: 12024 9455 2569 0 110 >>>>> 2163 >>>>> -/+ buffers/cache: 7180 4844 >>>>> Swap: 2047 0 2047 >>>>> >>>>> Still a lot free and a lot of free buffers/cache. >>>>> >>>>> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . < >>>>> th3penguinwhisperer@gmail.com>: >>>>> >>>>>> Still stuck with this. However I enabled GC logging. This shows the >>>>>> following: >>>>>> >>>>>> [root@myhost cassandra]# tail -f gc-1447180680.log >>>>>> 2015-11-10T18:41:45.516+0000: 225.428: [GC >>>>>> 2721842K->2066508K(6209536K), 0.0199040 secs] >>>>>> 2015-11-10T18:41:45.977+0000: 225.889: [GC >>>>>> 2721868K->2066511K(6209536K), 0.0221910 secs] >>>>>> 2015-11-10T18:41:46.437+0000: 226.349: [GC >>>>>> 2721871K->2066524K(6209536K), 0.0222140 secs] >>>>>> 2015-11-10T18:41:46.897+0000: 226.809: [GC >>>>>> 2721884K->2066539K(6209536K), 0.0224140 secs] >>>>>> 2015-11-10T18:41:47.359+0000: 227.271: [GC >>>>>> 2721899K->2066538K(6209536K), 0.0302520 secs] >>>>>> 2015-11-10T18:41:47.821+0000: 227.733: [GC >>>>>> 2721898K->2066557K(6209536K), 0.0280530 secs] >>>>>> 2015-11-10T18:41:48.293+0000: 228.205: [GC >>>>>> 2721917K->2066571K(6209536K), 0.0218000 secs] >>>>>> 2015-11-10T18:41:48.790+0000: 228.702: [GC >>>>>> 2721931K->2066780K(6209536K), 0.0292470 secs] >>>>>> 2015-11-10T18:41:49.290+0000: 229.202: [GC >>>>>> 2722140K->2066843K(6209536K), 0.0288740 secs] >>>>>> 2015-11-10T18:41:49.756+0000: 229.668: [GC >>>>>> 2722203K->2066818K(6209536K), 0.0283380 secs] >>>>>> 2015-11-10T18:41:50.249+0000: 230.161: [GC >>>>>> 2722178K->2067158K(6209536K), 0.0218690 secs] >>>>>> 2015-11-10T18:41:50.713+0000: 230.625: [GC >>>>>> 2722518K->2067236K(6209536K), 0.0278810 secs] >>>>>> >>>>>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and >>>>>> HEAP_NEWSIZE to 800MB. >>>>>> >>>>>> Still the same result. >>>>>> >>>>>> This looks very similar to following issue: >>>>>> >>>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/= %3CCAJ=3D3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=3DQ3A-_C3rsdD5kug@mail.gmail.com%3= E >>>>>> >>>>>> Is the only possibility to upgrade memory? I mean, I can't believe >>>>>> it's just loading all it's data in memory. That would require to kee= p >>>>>> scaling up the node to keep it work? >>>>>> >>>>>> >>>>>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . < >>>>>> th3penguinwhisperer@gmail.com>: >>>>>> >>>>>>> Correction... >>>>>>> I was grepping on Segmentation on the strace and it happens a lot. >>>>>>> >>>>>>> Do I need to run a scrub? >>>>>>> >>>>>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . < >>>>>>> th3penguinwhisperer@gmail.com>: >>>>>>> >>>>>>>> Hi Rob, >>>>>>>> >>>>>>>> Thanks for your reply. >>>>>>>> >>>>>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli : >>>>>>>> >>>>>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . < >>>>>>>>> th3penguinwhisperer@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> In Opscenter I see one of the nodes is orange. It seems like it'= s >>>>>>>>>> working on compaction. I used nodetool compactionstats and whene= ver I did >>>>>>>>>> this the Completed nad percentage stays the same (even with hour= s in >>>>>>>>>> between). >>>>>>>>>> >>>>>>>>> Are you the same person from IRC, or a second report today of >>>>>>>>> compaction hanging in this way? >>>>>>>>> >>>>>>>> Same person ;) Just didn't had things to work with from the chat >>>>>>>> there. I want to understand the issue more, see what I can tune or= fix. I >>>>>>>> want to do nodetool repair before upgrading to 2.1.11 but the comp= action is >>>>>>>> blocking it. >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> What version of Cassandra? >>>>>>>>> >>>>>>>> 2.0.9 >>>>>>>> >>>>>>>>> I currently don't see cpu load from cassandra on that node. So it >>>>>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have comp= action on >>>>>>>>>> the same columnfamily. I don't see any progress. >>>>>>>>>> >>>>>>>>>> WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13= ,677 ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compac= tions for usage_record_ptd. Probably there is an unusually large row in pr= ogress somewhere. It is also possible that buggy code left some sstables c= ompacting after it was done with them >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - How can I assure that nothing is happening? >>>>>>>>>> >>>>>>>>>> Find the thread that is doing compaction and strace it. Generall= y >>>>>>>>> it is one of the threads with a lower thread priority. >>>>>>>>> >>>>>>>> >>>>>>>> I have 141 threads. Not sure if that's normal. >>>>>>>> >>>>>>>> This seems to be the one: >>>>>>>> 61404 cassandr 24 4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java >>>>>>>> >>>>>>>> In the strace I see basically this part repeating (with once in a >>>>>>>> while the "resource temporarily unavailable"): >>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1 >>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) =3D 1 >>>>>>>> getpriority(PRIO_PROCESS, 61404) =3D 16 >>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1 >>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) =3D 0 >>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) =3D -1 EAGAIN >>>>>>>> (Resource temporarily unavailable) >>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0 >>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1 >>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) =3D 1 >>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) =3D 0 >>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0 >>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1 >>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) =3D 1 >>>>>>>> getpriority(PRIO_PROCESS, 61404) =3D 16 >>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1 >>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) =3D 1 >>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) =3D 0 >>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0 >>>>>>>> getpriority(PRIO_PROCESS, 61404) =3D 16 >>>>>>>> >>>>>>>> But wait! >>>>>>>> I also see this: >>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1 >>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) =3D 0 >>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0 >>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>>>>>>> >>>>>>>> This doesn't seem to happen that often though. >>>>>>>> >>>>>>>>> >>>>>>>>> Compaction often appears hung when decompressing a very large row= , >>>>>>>>> but usually not for "hours". >>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Is it recommended to disable compaction from a certain data >>>>>>>>>> size? (I believe 25GB on each node). >>>>>>>>>> >>>>>>>>>> It is almost never recommended to disable compaction. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Can I stop this compaction? nodetool stop compaction >>>>>>>>>> doesn't seem to work. >>>>>>>>>> >>>>>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop >>>>>>>>> it, but it'd likely just start again when you restart the node. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Is stopping the compaction dangerous? >>>>>>>>>> >>>>>>>>>> Not if you're in a version that properly cleans up partial >>>>>>>>> compactions, which is most of them. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Is killing the cassandra process dangerous while >>>>>>>>>> compacting(I did nodetool drain on one node)? >>>>>>>>>> >>>>>>>>>> No. But probably nodetool drain couldn't actually stop the >>>>>>>>> in-progress compaction either, FWIW. >>>>>>>>> >>>>>>>>>> This is output of nodetool compactionstats grepped for the >>>>>>>>>> keyspace that seems stuck. >>>>>>>>>> >>>>>>>>>> Do you have gigantic rows in that keyspace? What does cfstats sa= y >>>>>>>>> about the largest row compaction has seen/do you have log message= s about >>>>>>>>> compacting large rows? >>>>>>>>> >>>>>>>> >>>>>>>> I don't know about the gigantic rows. How can I check? >>>>>>>> >>>>>>>> I've checked the logs and found this: >>>>>>>> INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077 >>>>>>>> CompactionController.java (line 192) Compacting large row >>>>>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 byte= s) >>>>>>>> incrementally >>>>>>>> So this is from 6 hours ago. >>>>>>>> >>>>>>>> I also see a lot of messages like this: >>>>>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java >>>>>>>> (line 58) flushing high-traffic column family CFS(Keyspace=3D'myke= yspace', >>>>>>>> ColumnFamily=3D'mycolumnfamily') (estimated 100317609 bytes) >>>>>>>> And (although it's unrelated this might impact compaction >>>>>>>> performance?): >>>>>>>> WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172 >>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for >>>>>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified t= hreshold >>>>>>>> of 5120 by 8714. >>>>>>>> >>>>>>>> It's like the compaction is only doing one sstable at a time and i= s >>>>>>>> doing nothing a long time in between. >>>>>>>> >>>>>>>> cfstats for this keyspace and columnfamily gives the following: >>>>>>>> Table: mycolumnfamily >>>>>>>> SSTable count: 26 >>>>>>>> Space used (live), bytes: 319858991 >>>>>>>> Space used (total), bytes: 319860267 >>>>>>>> SSTable Compression Ratio: 0.24265700071674673 >>>>>>>> Number of keys (estimate): 6656 >>>>>>>> Memtable cell count: 22710 >>>>>>>> Memtable data size, bytes: 3310654 >>>>>>>> Memtable switch count: 31 >>>>>>>> Local read count: 0 >>>>>>>> Local read latency: 0.000 ms >>>>>>>> Local write count: 997667 >>>>>>>> Local write latency: 0.000 ms >>>>>>>> Pending tasks: 0 >>>>>>>> Bloom filter false positives: 0 >>>>>>>> Bloom filter false ratio: 0.00000 >>>>>>>> Bloom filter space used, bytes: 12760 >>>>>>>> Compacted partition minimum bytes: 1332 >>>>>>>> Compacted partition maximum bytes: 43388628 >>>>>>>> Compacted partition mean bytes: 234682 >>>>>>>> Average live cells per slice (last five minutes): >>>>>>>> 0.0 >>>>>>>> Average tombstones per slice (last five minutes): >>>>>>>> 0.0 >>>>>>>> >>>>>>>> >>>>>>>>> I also see frequently lines like this in system.log: >>>>>>>>>> >>>>>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 B= atchStatement.java (line 223) Batch of prepared statements for [billing.usa= ge_record_by_billing_period, billing.metric] is of size 53086, exceeding sp= ecified threshold of 5120 by 47966. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Unrelated. >>>>>>>>> >>>>>>>>> =3DRob >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction >>>>>>>> being stuck? >>>>>>>> Another thing to mention is that nodetool repair didn't run yet. I= t >>>>>>>> got installed but nobody bothered to schedule the repair. >>>>>>>> >>>>>>>> Thanks for looking into this! >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> --047d7b6700cfed51ca0525e7a1d2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
So it seems I found the problem.
The node opening a stream is waiting for the other node to respond = but that node never responds due to a broken pipe which makes Cassandra wai= t forever.

It's basically this issue: https://issues.apache.org/jira= /browse/CASSANDRA-8472
And this is the workaround/fix: https://issues.ap= ache.org/jira/browse/CASSANDRA-8611

So:
- update cassandra to >=3D2.0.11
- add option streaming_s= ocket_timeout_in_ms =3D 10000
- do rolling restart of cassand= ra

What's weird is that the IOException: B= roken pipe is never shown in my logs (not on any node). And my logging is s= et to INFO in log4j config.
I have this config in log4j-serve= r.properties:
# output messages into a rolling log file = as well as stdout
log4j.rootLogger=3DINFO,stdout,R
=C2=A0
# stdout
log4j.appender.stdout=3Dorg.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=3Dorg.apache.log4j.PatternLayo= ut
log4j.appender.stdout.layout.ConversionPattern=3D%5p %d{HH:= mm:ss,SSS} %m%n
=C2=A0
# rolling log file
log4j.appender.R=3Dorg.apache.log4j.RollingFileAppender
log4j.appender.R.maxFileSize=3D20MB
log4j.appender.R.maxBackupIndex=3D50
log4j.appender.R.layout=3Dorg.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=3D%5p [%t] %d{ISO= 8601} %F (line %L) %m%n
# Edit the next line to point to your logs directory
log4j.appender.R.File=3D/var/log/cassandra/system.log
=C2=A0
# Application logging options
#log4j.logger.org.apache.cassandra=3DDEBUG
#log4j.logger.org.apache.cassandra.db=3DDEBUG
#log4j.logger.org.apache.cassandra.service.StorageProxy=3DD= EBUG
=C2=A0
# Adding this to avoid thrift logging disconnect errors. log4j.logger.org.apache.thrift.server.TNonblockingServer=3DERROR
<= div>
Too bad nobody else could point to those. Hope it helps someo= ne else from wasting a lot of time.
2015-11-11 15:42 GMT+01:00 Sebastian Estevez <sebastian.estevez@datastax.com>:

Use 'nodetool compactionhistory'<= /p>

all the best,

Sebasti=C3=A1n

On Nov 11, 2015 3:23 AM, "PenguinWhispererT= he ." <th3penguinwhisperer@gmail.com> wrote:
Does compac= tionstats shows only stats for completed compactions (100%)? It might be th= at the compaction is running constantly, over and over again.
In t= hat case I need to know what I might be able to do to stop this constant co= mpaction so I can start a nodetool repair.

Note that there is = a lot of traffic on this columnfamily so I'm not sure if temporary disa= bling compaction is an option. The repair will probably take long as well.<= br>
Sebastian and Rob: do you might have any more ideas about the = things I put in this thread? Any help is appreciated!

2015-11-10 20:03 GMT+01:00 = PenguinWhispererThe . <th3penguinwhisperer@gmail.com>:
Hi= Sebastian,

Thanks for your response.

No swap is = used. No offense, I just don't see a reason why having swap would be th= e issue here. I put swapiness on 1. I also have jna installed. That should = prevent java being swapped out as wel AFAIK.

<= div class=3D"gmail_extra">
2015-11-10 19:50 G= MT+01:00 Sebastian Estevez <sebastian.estevez@datastax.com>:

All the best,


Sebasti=C3=A1n Est=C3=A9vez

Solutio= ns Architect | 954 905 8615 | sebastian.estevez@datastax.com=

= 3D"face= = 3D"twitter.png= <= a href=3D"https://plus.google.com/+Datastax/about" style=3D"color:rgb(17,85= ,204);font-size:12.8000001907349px;line-height:11.7760000228882px;text-deco= ration:none" target=3D"_blank">3D"g+.png"


3D""

=
Da<= span style=3D"font-size:12px;font-family:Arial;color:rgb(0,0,0);vertical-al= ign:baseline;white-space:pre-wrap">taStax is the faste= st, most scalable distrib= uted database technology, delivering Apache Cassandra to the world=E2=80=99= s most innovative enterprises. Datastax is built to be agile, always-on, an= d predictably scalable to any size. With more than 500 customers in 45 coun= tries, DataStax is the database technology and tr= ansactional backbone of choice for the worlds most innovative companies suc= h as Netflix, Adobe, Intuit, and eBay.

On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhis= pererThe . <th3penguinwhisperer@gmail.com> wrote= :
I also have the f= ollowing memory usage:
[root@US-BILLINGDSX4 cassandra]# free -m
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 used=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 free=C2=A0=C2=A0=C2=A0=C2=A0 shared=C2=A0=C2=A0=C2=A0 buffers=C2=A0=C2= =A0=C2=A0=C2=A0 cached
Mem:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 12024=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 9455=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 2569=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 110=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 2163
-/+ buffers/cache:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 71= 80=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 4844
Swap:=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 2047=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2047

Still = a lot free and a lot of free buffers/cache.

2015-11-10 19:45 GMT+01:00 = PenguinWhispererThe . <th3penguinwhisperer@gmail.com>:
Still stuck with this. However I enabled GC logging. This shows the foll= owing:

[root@myhost cassandra]# tail -f gc-1447180680.log
2015-11= -10T18:41:45.516+0000: 225.428: [GC 2721842K->2066508K(6209536K), 0.0199= 040 secs]
2015-11-10T18:41:45.977+0000: 225.889: [GC 2721868K->206651= 1K(6209536K), 0.0221910 secs]
2015-11-10T18:41:46.437+0000: 226.349: [GC= 2721871K->2066524K(6209536K), 0.0222140 secs]
2015-11-10T18:41:46.89= 7+0000: 226.809: [GC 2721884K->2066539K(6209536K), 0.0224140 secs]
20= 15-11-10T18:41:47.359+0000: 227.271: [GC 2721899K->2066538K(6209536K), 0= .0302520 secs]
2015-11-10T18:41:47.821+0000: 227.733: [GC 2721898K->2= 066557K(6209536K), 0.0280530 secs]
2015-11-10T18:41:48.293+0000: 228.205= : [GC 2721917K->2066571K(6209536K), 0.0218000 secs]
2015-11-10T18:41:= 48.790+0000: 228.702: [GC 2721931K->2066780K(6209536K), 0.0292470 secs]<= br>2015-11-10T18:41:49.290+0000: 229.202: [GC 2722140K->2066843K(6209536= K), 0.0288740 secs]
2015-11-10T18:41:49.756+0000: 229.668: [GC 2722203K-= >2066818K(6209536K), 0.0283380 secs]
2015-11-10T18:41:50.249+0000: 23= 0.161: [GC 2722178K->2067158K(6209536K), 0.0218690 secs]
2015-11-10T1= 8:41:50.713+0000: 230.625: [GC 2722518K->2067236K(6209536K), 0.0278810 s= ecs]

This is a VM with 12GB of RAM. Highered the HEAP_SIZE to = 6GB and HEAP_NEWSIZE to 800MB.

Still the same result.

<= /div>This looks very similar to following issue:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCA= J=3D3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=3DQ3A-_C3rsdD5kug@mail.gmail.com%3E=

Is the only possibility to upgrade memory? I mean, I can'= t believe it's just loading all it's data in memory. That would req= uire to keep scaling up the node to keep it work?


2015-11-10 9:36 GMT+01:00 PenguinWhispe= rerThe . <th3penguinwhisperer@gmail.com>:
Correction...
I was grepping on Segmentation on the strace and it happens a lot.
Do I need to run a scrub?

2015-11-10 9:30 GMT+01:00 PenguinWhispe= rerThe . <th3penguinwhisperer@gmail.com>:
Hi Rob,

Than= ks for your reply.

2015-11-09 23:17 GMT+01:00 Robert Coli <rcoli@ev= entbrite.com>:
On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <th3penguinwhisperer@gmail.com> wrote:=C2=A0

In Opscenter I see one of the nodes is ora= nge. It seems like it's=20 working on compaction. I used nodetool compactionstats and whenever I did this the Completed=20 nad percentage stays the same (even with hours in between).

=
Are you the same person from IRC= , or a second report today of compaction hanging in this way?
Same person ;) Just didn't had thing= s to work with from the chat there. I want to understand the issue more, se= e what I can tune or fix. I want to do nodetool repair before upgrading to = 2.1.11 but the compaction is blocking it.

=C2=A0
What versi= on of Cassandra?
2.0.9
<= span>

I currently=20 don't see cpu load from cassandra on that node. So it seems stuck (somewhere mid 60%). Also some other nodes have=20 compaction on the same columnfamily. I don't see any progress.

 WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13=
,677 ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compac=
tions for usage_record_ptd.  Probably there is an unusually large row in pr=
ogress somewhere.  It is also possible that buggy code left some sstables c=
ompacting after it was done with them
  • How can I assure that nothing is happening?
Find the thread that is doing comp= action and strace it. Generally it is one of the threads with a lower threa= d priority.=C2=A0
=C2= =A0
I have 141 threads. Not sure if that's normal.
This seems to be the one:
=C2=A061404 cassandr=C2=A0 24=C2=A0=C2=A0 4 = 8948m 4.3g 820m R 90.2 36.8 292:54.47 java

In the strace = I see basically this part repeating (with once in a while the "resourc= e temporarily unavailable"):
futex(0x7f5c64145e54, FUTEX_WAKE_OP_PR= IVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) =3D 1
getpriority(PRIO_PR= OCESS, 61404)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D 16
futex(0x7= f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0,= FUTEX_OP_CMP_GT, 1}) =3D 1
futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1)= =3D 0
futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) =3D -1 EAGAIN = (Resource temporarily unavailable)
futex(0x1233828, FUTEX_WAKE_PRIVATE, = 1) =3D 0
futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e= 50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1
futex(0x7f5c64145e28, F= UTEX_WAKE_PRIVATE, 1) =3D 1
futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047,= NULL) =3D 0
futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0
futex(0x7f= 5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, = FUTEX_OP_CMP_GT, 1}) =3D 1
futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = =3D 1
getpriority(PRIO_PROCESS, 61404)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 =3D 16
futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f= 5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) =3D 1
futex(0x7f5c641= 45e28, FUTEX_WAKE_PRIVATE, 1) =3D 1
futex(0x1233854, FUTEX_WAIT_PRIVATE,= 494049, NULL) =3D 0
futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0
ge= tpriority(PRIO_PROCESS, 61404)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = =3D 16

But wait!
I also see this:
futex(= 0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET,= 0, FUTEX_OP_CMP_GT, 1}) =3D 1
futex(0x1233854, FUTEX_WAIT_PRIVATE, 4940= 55, NULL) =3D 0
futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) =3D 0
--- SIG= SEGV (Segmentation fault) @ 0 (0) ---

This doesn't se= em to happen that often though.

Compaction often app= ears hung when decompressing a very large row, but usually not for "ho= urs".
    Is it recommended to disable compaction from a certain data size? (I belie= ve 25GB on each node).
=
It is almost never recommended to disable compaction.
<= div class=3D"gmail_quote">
  • Can I stop this com= paction? nodetool stop compaction doesn't seem to work.
=
Killing the JVM ("the= dungeon collapses!") would certainly stop it, but it'd likely jus= t start again when you restart the node.=C2=A0
  • Is stopping the compaction dangerous?<= /li>
=C2=A0Not i= f you're in a version that properly cleans up partial compactions, whic= h is most of them.
  • Is killing the cassandra process dangerous while compacting(I did = nodetool drain on one node)?
No. But probably nodetool drain couldn't actually sto= p the in-progress compaction either, FWIW.

This is output of nodetool compactionstats grepped for the keyspace that= seems stuck.

=
Do you have gigantic rows in that keyspace? = What does cfstats say about the largest row compaction has seen/do you have= log messages about compacting large rows?

I don't know about the gigantic rows. Ho= w can I check?

I've checked the logs and found this:<= br>=C2=A0INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077 CompactionCon= troller.java (line 192) Compacting large row billing/usage_record_ptd:17772= 7:2015-10-14 00\:00Z (243992466 bytes) incrementally
So this = is from 6 hours ago.

I also see a lot of messa= ges like this:
INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlu= sher.java (line 58) flushing high-traffic column family CFS(Keyspace=3D'= ;mykeyspace', ColumnFamily=3D'mycolumnfamily') (estimated 10031= 7609 bytes)
And (although it's unrelated this might impac= t compaction performance?):
=C2=A0WARN [Native-Transport-Requests:10514]= 2015-11-10 06:33:34,172 BatchStatement.java (line 223) Batch of prepared s= tatements for [billing.usage_record_ptd] is of size 13834, exceeding specif= ied threshold of 5120 by 8714.

It's like the compacti= on is only doing one sstable at a time and is doing nothing a long time in = between.

cfstats for this keyspace and columnf= amily gives the following:
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Table: mycolumnfamily
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 SSTable count: 26
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Space used (live), byte= s: 319858991
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Space used (total), bytes: 319860267
= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 SSTable Compression Ratio: 0.24265700071674673
=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 Number of keys (estimate): 6656
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Memtable cell cou= nt: 22710
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Memtable data size, bytes: 3310654
=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 Memtable switch count: 31
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Local read count: 0<= br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 Local read latency: 0.000 ms
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Local wr= ite count: 997667
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Local write latency: 0.000 ms
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 Pending tasks: 0
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Bloom filter false positiv= es: 0
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 Bloom filter false ratio: 0.00000
=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 Bloom filter space used, bytes: 12760
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Compacted part= ition minimum bytes: 1332
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Compacted partition maximum b= ytes: 43388628
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Compacted partition mean bytes: 234682=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 Average live cells per slice (last five minutes): 0.0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 Average tombstones per slice (last five minutes): 0.0=C2=A0
=
I also see frequently lines like this in system.log:
<= /pre>
WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 B=
atchStatement.java (line 223) Batch of prepared statements for [billing.usa=
ge_record_by_billing_period, billing.metric] is of size 53086, exceeding sp=
ecified threshold of 5120 by 47966.

Unrelated.

=3DR= ob
=C2=A0

Can I upgrad= e to 2.1.11 without doing a nodetool repair/compaction being stuck?
Another thing to mention is that nodetool repa= ir didn't run yet. It got installed but nobody bothered to schedule the= repair.

Thanks for looking into this!







--047d7b6700cfed51ca0525e7a1d2--