incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nate McCall <n...@thelastpickle.com>
Subject Re: Nodes get stuck
Date Wed, 21 Aug 2013 13:07:37 GMT
Hit send before i saw your update. Yeah turn gc grace way down. You can
turn your level size up a lot as well.
 On Aug 21, 2013 7:55 AM, "Keith Wright" <kwright@nanigans.com> wrote:

> So the stack appears to be related to walking tombstones for a fetch.  Can
> you please give me your take on if this is a plausible explanation:
>
>    - Given our data model, we can experience wide rows.  We protect
>    against these by randomly reading a portion on write and if the size is
>    beyond a certain threshold, we delete data
>    - This worked VERY well for some time now however perhaps we hit a row
>    that we deleted and has many tombstones.  The row is being requests
>    frequently so Cassandra is working very hard to process through all of its
>    tombstones (currently the RF # of nodes are at high load which again
>    suggests this).
>
> Question is what to do about it?  This is an LCS table with gc grace
> seconds at 86400.  I assume my only options are to force a major compaction
> via nodetool compaction or upgrades stables?  How can I validate this is
> the cause?  How can I prevent it going forward?  Set the gc grace seconds
> to a much lower value for that table?
>
> Thanks all!
>
> From: Keith Wright <kwright@nanigans.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Wednesday, August 21, 2013 8:31 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Nodes get stuck
>
> Thank you for responding.  I did a quick look and my mutation stage
> threads are currently in TIMED_WAITING (as expected since tpstats shows no
> active or pending) however most of my read stage threads are Runnable with
> the stack traces below.  I haven't dug into them yet but thought I would
> put them out there to see if anyone had any ideas since we are currently in
> a production down state.
>
> Thanks all!
>
> Most have the first stack:
>
> java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
>
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:69)
>
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
> java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
> java.util.TimSort.sort(TimSort.java:203)
> java.util.TimSort.sort(TimSort.java:173)
> java.util.Arrays.sort(Arrays.java:659)
> java.util.Collections.sort(Collections.java:217)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>
> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>
> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>
> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>
> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
> org.apache.cassandra.db.Table.getRow(Table.java:347)
>
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:722)
>
> Name: ReadStage:1719
> State: RUNNABLE
> Total blocked: 1,005  Total waited: 913
>
> Stack trace:
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:252)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>
> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>
> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>
> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>
> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
> org.apache.cassandra.db.Table.getRow(Table.java:347)
>
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:722)
>
> Name: ReadStage:1722
> State: RUNNABLE
> Total blocked: 1,001  Total waited: 897
>
> Stack trace:
> org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:58)
> org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:26)
>
> org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:229)
>
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:81)
>
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
> java.util.TimSort.binarySort(TimSort.java:265)
> java.util.TimSort.sort(TimSort.java:208)
> java.util.TimSort.sort(TimSort.java:173)
> java.util.Arrays.sort(Arrays.java:659)
> java.util.Collections.sort(Collections.java:217)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>
> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>
> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>
> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>
> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>
> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
> org.apache.cassandra.db.Table.getRow(Table.java:347)
>
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:722)
>
>
> From: Sylvain Lebresne <sylvain@datastax.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Wednesday, August 21, 2013 6:21 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Nodes get stuck
>
> A thread dump on one of the machine that has a suspiciously high CPU might
> help figuring out what it is that is taking all that CPU.
>
>
> On Wed, Aug 21, 2013 at 8:57 AM, Keith Wright <kwright@nanigans.com>wrote:
>
>> Some last minute info on this to hopefully enlighten.  We are doing ~200
>> reads and writes across our 7 node SSD cluster right now (usually can do
>> closer to 20K reads at least) and seeing CPU load as follows for the nodes
>> (with some par new to give an idea of GC):
>>
>> 001 – 1200%   (Par New at 120 ms / sec)
>> 002 – 6% (Par New at 0)
>> 003 – 600% (Par New at 45 ms / sec)
>> 004 – 900%
>> 005 – 500%
>> 006 – 10%
>> 007 – 130%
>>
>> There are no compactions running on 001 however I did see a broken pipe
>> error in the logs there (see below).  Netstats for 001 shows nothing
>> pending.  It appears that all of the load/latency is related to one column
>> family.  You can see cfstats & cfhistograms output below and note that we
>> are using LCS.  I have brought the odd cfhistograms behavior to the thread
>> before and am not sure what's going on there.  We are in a production down
>> situation right now so any help would be much appreciated!!!
>>
>> Column Family: global_user
>> SSTable count: 7546
>> SSTables in each level: [2, 10, 106/100, 453, 6975, 0, 0]
>> Space used (live): 83848742562
>> Space used (total): 83848742562
>> Number of Keys (estimate): 549792896
>> Memtable Columns Count: 526746
>> Memtable Data Size: 117408252
>> Memtable Switch Count: 0
>> Read Count: 11673
>> Read Latency: 1950.062 ms.
>> Write Count: 118588
>> Write Latency: 0.080 ms.
>> Pending Tasks: 0
>> Bloom Filter False Positives: 4322
>> Bloom Filter False Ratio: 0.84066
>> Bloom Filter Space Used: 383507440
>> Compacted row minimum size: 73
>> Compacted row maximum size: 2816159
>> Compacted row mean size: 324
>>
>> [kwright@lxpcas001 ~]$ nodetool cfhistograms users global_user
>> users/global_user histograms
>> Offset      SSTables     Write Latency      Read Latency          Row
>> Size      Column Count
>> 1               8866                 0                 0
>> 0              3420
>> 2               1001                 0                 0
>> 0          99218975
>> 3               1249                 0                 0
>> 0         319713048
>> 4               1074                 0                 0
>> 0          25073893
>> 5                132                 0                 0
>> 0          15359199
>> 6                  0                 0                 0
>> 0          27794925
>> 7                  0                12                 0
>> 0           7954974
>> 8                  0                23                 0
>> 0           7733934
>> 10                 0               184                 0
>> 0          13276275
>> 12                 0               567                 0
>> 0           9077508
>> 14                 0              1098                 0
>> 0           5879292
>> 17                 0              2722                 0
>> 0           5693471
>> 20                 0              4379                 0
>> 0           3204131
>> 24                 0              8928                 0
>> 0           2614995
>> 29                 0             13525                 0
>> 0           1824584
>> 35                 0             16759                 0
>> 0           1265911
>> 42                 0             17048                 0
>> 0            868075
>> 50                 0             14162                 5
>> 0            596417
>> 60                 0             11806                15
>> 0            467747
>> 72                 0              8569               108
>> 0            354276
>> 86                 0              7042               276
>> 227            269987
>> 103                0              5936               372
>>  2972            218931
>> 124                0              4538               577
>> 157            181360
>> 149                0              2981              1076
>> 7388090            144298
>> 179                0              1929              1529
>>  90535838            116628
>> 215                0              1081              1450
>> 182701876             93378
>> 258                0               499              1125
>> 141393480             74052
>> 310                0               124               756
>>  18883224             58617
>> 372                0                31               460
>>  24599272             45453
>> 446                0                25               247
>>  23516772             34310
>> 535                0                10               146
>>  13987584             26168
>> 642                0                20               194
>>  12091458             19965
>> 770                0                 8               196
>> 9269197             14649
>> 924                0                 9               340
>> 8082898             11015
>> 1109               0                 9               225
>> 4762865              8058
>> 1331               0                 9               154
>> 3330110              5866
>> 1597               0                 8               144
>> 2367615              4275
>> 1916               0                 1               188
>> 1633608              3087
>> 2299               0                 4               216
>> 1139820              2196
>> 2759               0                 5               201
>>  819019              1456
>> 3311               0                 4               194
>>  600522              1135
>> 3973               0                 6               181
>>  454566               786
>> 4768               0                13               136
>>  353886               587
>> 5722               0                 6               152
>>  280630               400
>> 6866               0                 5                80
>>  225545               254
>> 8239               0                 6               112
>>  183285               138
>> 9887               0                 0                68
>>  149820               109
>> 11864              0                 5                99
>>  121722                66
>> 14237              0                57                86
>> 98352                50
>> 17084              0                18                99
>> 79085                35
>> 20501              0                 1                93
>> 62423                11
>> 24601              0                 0                61
>> 49471                 9
>> 29521              0                 0                69
>> 37395                 5
>> 35425              0                 4                56
>> 28611                 6
>> 42510              0                 0                57
>> 21876                 1
>> 51012              0                 9                60
>> 16105                 0
>> 61214              0                 0                52
>> 11996                 0
>> 73457              0                 0                50
>>  8791                 0
>> 88148              0                 0                38
>>  6430                 0
>> 105778             0                 0                25
>>  4660                 0
>> 126934             0                 0                15
>>  3308                 0
>> 152321             0                 0                 2
>>  2364                 0
>> 182785             0                 0                 0
>>  1631                 0
>> 219342             0                 0                 0
>>  1156                 0
>> 263210             0                 0                 0
>> 887                 0
>> 315852             0                 0                 0
>> 618                 0
>> 379022             0                 0                 0
>> 427                 0
>> 454826             0                 0                 0
>> 272                 0
>> 545791             0                 0                 0
>> 168                 0
>> 654949             0                 0                 0
>> 115                 0
>> 785939             0                 0                 0
>>  61                 0
>> 943127             0                 0                 0
>>  58                 0
>> 1131752            0                 0                 0
>>  34                 0
>> 1358102            0                 0                 0
>>  19                 0
>> 1629722            0                 0                 0
>> 9                 0
>> 1955666            0                 0                 0
>> 4                 0
>> 2346799            0                 0                 0
>> 5                 0
>> 2816159            0                 0                 0
>> 2                 0
>> 3379391            0                 0                 0
>> 0                 0
>> 4055269            0                 0                 0
>> 0                 0
>> 4866323            0                 0                 0
>> 0                 0
>> 5839588            0                 0                 0
>> 0                 0
>> 7007506            0                 0                 0
>> 0                 0
>> 8409007            0                 0                 0
>> 0                 0
>> 10090808           0                 0                 0
>> 0                 0
>> 12108970           0                 0                 0
>> 0                 0
>> 14530764           0                 0                 0
>> 0                 0
>> 17436917           0                 0                 0
>> 0                 0
>> 20924300           0                 0                 0
>> 0                 0
>> 25109160           0                 0                 0
>> 0                 0
>>
>> ERROR [WRITE-/10.8.44.98] 2013-08-21 06:50:25,450
>> OutboundTcpConnection.java (line 197) error writing to /10.8.44.98
>> java.lang.RuntimeException: java.io.IOException: Broken pipe
>> at
>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
>> at
>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
>> at
>> org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
>> at org.apache.cassandra.db.Row$RowSerializer.serialize(Row.java:62)
>> at
>> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:78)
>> at
>> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:69)
>> at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:131)
>> at
>> org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:221)
>> at
>> org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:186)
>> at
>> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
>> Caused by: java.io.IOException: Broken pipe
>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
>> at java.nio.channels.Channels.writeFully(Channels.java:98)
>> at java.nio.channels.Channels.access$000(Channels.java:61)
>> at java.nio.channels.Channels$1.write(Channels.java:174)
>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>> at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
>> at
>> org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
>> at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
>> at java.io.DataOutputStream.write(DataOutputStream.java:107)
>> at
>> org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
>> at
>> org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
>> at
>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
>> ... 9 more
>>
>>
>> From: Keith Wright <kwright@nanigans.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Wednesday, August 21, 2013 2:35 AM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Nodes get stuck
>>
>> Still looking for help!  We have stopped almost ALL traffic to the
>> cluster and still some nodes are showing almost 1000% CPU for cassandra
>> with no iostat activity.   We were running cleanup on one of the nodes that
>> was not showing load spikes however now when I attempt to stop cleanup
>> there via nodetool stop cleanup the java task for stopping cleanup itself
>> is at 1500% and has not returned after 2 minutes.  This is VERY odd
>> behavior.  Any ideas?  Hardware failure?  Network?  We are not seeing
>> anything there but wanted to get ideas.
>>
>> Thanks
>>
>> From: Keith Wright <kwright@nanigans.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Tuesday, August 20, 2013 8:32 PM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Nodes get stuck
>>
>> Hi all,
>>
>>     We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior
>> recently where 3 of our nodes get locked up in high load in what appears to
>> be a GC spiral while the rest of the cluster (7 total nodes) appears fine.
>>  When I run a tpstats, I see the following (assuming tpstats returns at
>> all) and top shows cassandra pegged at 2000%.  Obviously we have a large
>> number of blocked reads.  In the past I could explain this due to
>> unexpectedly wide rows however we have handled that.  When the cluster
>> starts to meltdown like this its hard to get visibility into what's going
>> on and what triggered the issue as everything starts to pile on.  Opscenter
>> becomes unusable and because the effected nodes are in GC pressure, getting
>> any data via nodetool or JMX is also difficult.  What do people do to
>> handle these situations?  We are going to start graphing
>> reads/writes/sec/CF to Ganglia in the hopes that it helps.
>>
>> Thanks
>>
>> Pool Name                    Active   Pending      Completed   Blocked
>>  All time blocked
>> ReadStage                       256       381     1245117434         0
>>               0
>> RequestResponseStage              0         0     1161495947         0
>>               0
>> MutationStage                     8         8      481721887         0
>>               0
>> ReadRepairStage                   0         0       85770600         0
>>               0
>> ReplicateOnWriteStage             0         0       21896804         0
>>               0
>> GossipStage                       0         0        1546196         0
>>               0
>> AntiEntropyStage                  0         0           5009         0
>>               0
>> MigrationStage                    0         0           1082         0
>>               0
>> MemtablePostFlusher               0         0          10178         0
>>               0
>> FlushWriter                       0         0           6081         0
>>            2075
>> MiscStage                         0         0             57         0
>>               0
>> commitlog_archiver                0         0              0         0
>>               0
>> AntiEntropySessions               0         0              0         0
>>               0
>> InternalResponseStage             0         0              6         0
>>               0
>> HintedHandoff                     1         1            246         0
>>               0
>>
>> Message type           Dropped
>> RANGE_SLICE                482
>> READ_REPAIR                  0
>> BINARY                       0
>> READ                    515762
>> MUTATION                    39
>> _TRACE                       0
>> REQUEST_RESPONSE            29
>>
>>
>

Mime
View raw message