incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wright <kwri...@nanigans.com>
Subject Re: Nodes get stuck
Date Wed, 21 Aug 2013 13:48:47 GMT
Are many people running at 1.2.8?  Any issues?  Just nervous about running on the latest. 
Would prefer to be a couple of versions behind as new bugs due to tend to popup.

Thanks all

From: Nate McCall <nate@thelastpickle.com<mailto:nate@thelastpickle.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 9:33 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes get stuck

We use 128m and gc_grace of 300 on a CF with highly transient data (holds values for client
locking via wait-chain algorithm implementation in hector).

Same minor version upgrades should be painless and do-able with no downtime.


On Wed, Aug 21, 2013 at 8:28 AM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
We have our LCS sstable size at 64 MB and gc grace at 86400.  May I ask what values you use?
 I saw that in 2.0 they are setting LCS default sstable size to 160 MB.

Does anyone see any risk in upgrading from 1.2.4 to 1.2.8?  Upgrade steps do not appear to
mention any actions required and that a rolling upgrade should be safe

From: Nate McCall <nate@thelastpickle.com<mailto:nate@thelastpickle.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 9:07 AM
To: Cassandra Users <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>

Subject: Re: Nodes get stuck


Hit send before i saw your update. Yeah turn gc grace way down. You can turn your level size
up a lot as well.

On Aug 21, 2013 7:55 AM, "Keith Wright" <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
So the stack appears to be related to walking tombstones for a fetch.  Can you please give
me your take on if this is a plausible explanation:

 *   Given our data model, we can experience wide rows.  We protect against these by randomly
reading a portion on write and if the size is beyond a certain threshold, we delete data
 *   This worked VERY well for some time now however perhaps we hit a row that we deleted
and has many tombstones.  The row is being requests frequently so Cassandra is working very
hard to process through all of its tombstones (currently the RF # of nodes are at high load
which again suggests this).

Question is what to do about it?  This is an LCS table with gc grace seconds at 86400.  I
assume my only options are to force a major compaction via nodetool compaction or upgrades
stables?  How can I validate this is the cause?  How can I prevent it going forward?  Set
the gc grace seconds to a much lower value for that table?

Thanks all!

From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 8:31 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes get stuck

Thank you for responding.  I did a quick look and my mutation stage threads are currently
in TIMED_WAITING (as expected since tpstats shows no active or pending) however most of my
read stage threads are Runnable with the stack traces below.  I haven't dug into them yet
but thought I would put them out there to see if anyone had any ideas since we are currently
in a production down state.

Thanks all!

Most have the first stack:

java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:69)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
java.util.TimSort.sort(TimSort.java:203)
java.util.TimSort.sort(TimSort.java:173)
java.util.Arrays.sort(Arrays.java:659)
java.util.Collections.sort(Collections.java:217)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
org.apache.cassandra.db.Table.getRow(Table.java:347)
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)

Name: ReadStage:1719
State: RUNNABLE
Total blocked: 1,005  Total waited: 913

Stack trace:
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:252)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
org.apache.cassandra.db.Table.getRow(Table.java:347)
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)

Name: ReadStage:1722
State: RUNNABLE
Total blocked: 1,001  Total waited: 897

Stack trace:
org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:58)
org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:26)
org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:229)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:81)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
java.util.TimSort.binarySort(TimSort.java:265)
java.util.TimSort.sort(TimSort.java:208)
java.util.TimSort.sort(TimSort.java:173)
java.util.Arrays.sort(Arrays.java:659)
java.util.Collections.sort(Collections.java:217)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
org.apache.cassandra.db.Table.getRow(Table.java:347)
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)


From: Sylvain Lebresne <sylvain@datastax.com<mailto:sylvain@datastax.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 6:21 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes get stuck

A thread dump on one of the machine that has a suspiciously high CPU might help figuring out
what it is that is taking all that CPU.


On Wed, Aug 21, 2013 at 8:57 AM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
Some last minute info on this to hopefully enlighten.  We are doing ~200 reads and writes
across our 7 node SSD cluster right now (usually can do closer to 20K reads at least) and
seeing CPU load as follows for the nodes (with some par new to give an idea of GC):

001 – 1200%   (Par New at 120 ms / sec)
002 – 6% (Par New at 0)
003 – 600% (Par New at 45 ms / sec)
004 – 900%
005 – 500%
006 – 10%
007 – 130%

There are no compactions running on 001 however I did see a broken pipe error in the logs
there (see below).  Netstats for 001 shows nothing pending.  It appears that all of the load/latency
is related to one column family.  You can see cfstats & cfhistograms output below and
note that we are using LCS.  I have brought the odd cfhistograms behavior to the thread before
and am not sure what's going on there.  We are in a production down situation right now so
any help would be much appreciated!!!

Column Family: global_user
SSTable count: 7546
SSTables in each level: [2, 10, 106/100, 453, 6975, 0, 0]
Space used (live): 83848742562
Space used (total): 83848742562
Number of Keys (estimate): 549792896
Memtable Columns Count: 526746
Memtable Data Size: 117408252
Memtable Switch Count: 0
Read Count: 11673
Read Latency: 1950.062 ms.
Write Count: 118588
Write Latency: 0.080 ms.
Pending Tasks: 0
Bloom Filter False Positives: 4322
Bloom Filter False Ratio: 0.84066
Bloom Filter Space Used: 383507440
Compacted row minimum size: 73
Compacted row maximum size: 2816159
Compacted row mean size: 324

[kwright@lxpcas001 ~]$ nodetool cfhistograms users global_user
users/global_user histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1               8866                 0                 0                 0              3420
2               1001                 0                 0                 0          99218975
3               1249                 0                 0                 0         319713048
4               1074                 0                 0                 0          25073893
5                132                 0                 0                 0          15359199
6                  0                 0                 0                 0          27794925
7                  0                12                 0                 0           7954974
8                  0                23                 0                 0           7733934
10                 0               184                 0                 0          13276275
12                 0               567                 0                 0           9077508
14                 0              1098                 0                 0           5879292
17                 0              2722                 0                 0           5693471
20                 0              4379                 0                 0           3204131
24                 0              8928                 0                 0           2614995
29                 0             13525                 0                 0           1824584
35                 0             16759                 0                 0           1265911
42                 0             17048                 0                 0            868075
50                 0             14162                 5                 0            596417
60                 0             11806                15                 0            467747
72                 0              8569               108                 0            354276
86                 0              7042               276               227            269987
103                0              5936               372              2972            218931
124                0              4538               577               157            181360
149                0              2981              1076           7388090            144298
179                0              1929              1529          90535838            116628
215                0              1081              1450         182701876             93378
258                0               499              1125         141393480             74052
310                0               124               756          18883224             58617
372                0                31               460          24599272             45453
446                0                25               247          23516772             34310
535                0                10               146          13987584             26168
642                0                20               194          12091458             19965
770                0                 8               196           9269197             14649
924                0                 9               340           8082898             11015
1109               0                 9               225           4762865              8058
1331               0                 9               154           3330110              5866
1597               0                 8               144           2367615              4275
1916               0                 1               188           1633608              3087
2299               0                 4               216           1139820              2196
2759               0                 5               201            819019              1456
3311               0                 4               194            600522              1135
3973               0                 6               181            454566               786
4768               0                13               136            353886               587
5722               0                 6               152            280630               400
6866               0                 5                80            225545               254
8239               0                 6               112            183285               138
9887               0                 0                68            149820               109
11864              0                 5                99            121722               
66
14237              0                57                86             98352               
50
17084              0                18                99             79085               
35
20501              0                 1                93             62423               
11
24601              0                 0                61             49471               
 9
29521              0                 0                69             37395               
 5
35425              0                 4                56             28611               
 6
42510              0                 0                57             21876               
 1
51012              0                 9                60             16105               
 0
61214              0                 0                52             11996               
 0
73457              0                 0                50              8791               
 0
88148              0                 0                38              6430               
 0
105778             0                 0                25              4660               
 0
126934             0                 0                15              3308               
 0
152321             0                 0                 2              2364               
 0
182785             0                 0                 0              1631               
 0
219342             0                 0                 0              1156               
 0
263210             0                 0                 0               887               
 0
315852             0                 0                 0               618               
 0
379022             0                 0                 0               427               
 0
454826             0                 0                 0               272               
 0
545791             0                 0                 0               168               
 0
654949             0                 0                 0               115               
 0
785939             0                 0                 0                61               
 0
943127             0                 0                 0                58               
 0
1131752            0                 0                 0                34               
 0
1358102            0                 0                 0                19               
 0
1629722            0                 0                 0                 9               
 0
1955666            0                 0                 0                 4               
 0
2346799            0                 0                 0                 5               
 0
2816159            0                 0                 0                 2               
 0
3379391            0                 0                 0                 0               
 0
4055269            0                 0                 0                 0               
 0
4866323            0                 0                 0                 0               
 0
5839588            0                 0                 0                 0               
 0
7007506            0                 0                 0                 0               
 0
8409007            0                 0                 0                 0               
 0
10090808           0                 0                 0                 0               
 0
12108970           0                 0                 0                 0               
 0
14530764           0                 0                 0                 0               
 0
17436917           0                 0                 0                 0               
 0
20924300           0                 0                 0                 0               
 0
25109160           0                 0                 0                 0               
 0

ERROR [WRITE-/10.8.44.98<http://10.8.44.98>] 2013-08-21 06:50:25,450 OutboundTcpConnection.java
(line 197) error writing to /10.8.44.98<http://10.8.44.98>
java.lang.RuntimeException: java.io.IOException: Broken pipe
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
at org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at org.apache.cassandra.db.Row$RowSerializer.serialize(Row.java:62)
at org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:78)
at org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:69)
at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:131)
at org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:221)
at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:186)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
at java.nio.channels.Channels.writeFully(Channels.java:98)
at java.nio.channels.Channels.access$000(Channels.java:61)
at java.nio.channels.Channels$1.write(Channels.java:174)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
at org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
... 9 more


From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 2:35 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes get stuck

Still looking for help!  We have stopped almost ALL traffic to the cluster and still some
nodes are showing almost 1000% CPU for cassandra with no iostat activity.   We were running
cleanup on one of the nodes that was not showing load spikes however now when I attempt to
stop cleanup there via nodetool stop cleanup the java task for stopping cleanup itself is
at 1500% and has not returned after 2 minutes.  This is VERY odd behavior.  Any ideas?  Hardware
failure?  Network?  We are not seeing anything there but wanted to get ideas.

Thanks

From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, August 20, 2013 8:32 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Nodes get stuck

Hi all,

    We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior recently where 3 of
our nodes get locked up in high load in what appears to be a GC spiral while the rest of the
cluster (7 total nodes) appears fine.  When I run a tpstats, I see the following (assuming
tpstats returns at all) and top shows cassandra pegged at 2000%.  Obviously we have a large
number of blocked reads.  In the past I could explain this due to unexpectedly wide rows however
we have handled that.  When the cluster starts to meltdown like this its hard to get visibility
into what's going on and what triggered the issue as everything starts to pile on.  Opscenter
becomes unusable and because the effected nodes are in GC pressure, getting any data via nodetool
or JMX is also difficult.  What do people do to handle these situations?  We are going to
start graphing reads/writes/sec/CF to Ganglia in the hopes that it helps.

Thanks

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                       256       381     1245117434         0                 0
RequestResponseStage              0         0     1161495947         0                 0
MutationStage                     8         8      481721887         0                 0
ReadRepairStage                   0         0       85770600         0                 0
ReplicateOnWriteStage             0         0       21896804         0                 0
GossipStage                       0         0        1546196         0                 0
AntiEntropyStage                  0         0           5009         0                 0
MigrationStage                    0         0           1082         0                 0
MemtablePostFlusher               0         0          10178         0                 0
FlushWriter                       0         0           6081         0              2075
MiscStage                         0         0             57         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0              6         0                 0
HintedHandoff                     1         1            246         0                 0

Message type           Dropped
RANGE_SLICE                482
READ_REPAIR                  0
BINARY                       0
READ                    515762
MUTATION                    39
_TRACE                       0
REQUEST_RESPONSE            29




Mime
View raw message