incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wright <kwri...@nanigans.com>
Subject Re: Nodes get stuck
Date Wed, 21 Aug 2013 12:31:27 GMT
Thank you for responding.  I did a quick look and my mutation stage threads are currently in
TIMED_WAITING (as expected since tpstats shows no active or pending) however most of my read
stage threads are Runnable with the stack traces below.  I haven't dug into them yet but thought
I would put them out there to see if anyone had any ideas since we are currently in a production
down state.

Thanks all!

Most have the first stack:

java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:69)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
java.util.TimSort.sort(TimSort.java:203)
java.util.TimSort.sort(TimSort.java:173)
java.util.Arrays.sort(Arrays.java:659)
java.util.Collections.sort(Collections.java:217)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
org.apache.cassandra.db.Table.getRow(Table.java:347)
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)

Name: ReadStage:1719
State: RUNNABLE
Total blocked: 1,005  Total waited: 913

Stack trace:
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:252)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
org.apache.cassandra.db.Table.getRow(Table.java:347)
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)

Name: ReadStage:1722
State: RUNNABLE
Total blocked: 1,001  Total waited: 897

Stack trace:
org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:58)
org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:26)
org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:229)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:81)
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
java.util.TimSort.binarySort(TimSort.java:265)
java.util.TimSort.sort(TimSort.java:208)
java.util.TimSort.sort(TimSort.java:173)
java.util.Arrays.sort(Arrays.java:659)
java.util.Collections.sort(Collections.java:217)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
org.apache.cassandra.db.Table.getRow(Table.java:347)
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)


From: Sylvain Lebresne <sylvain@datastax.com<mailto:sylvain@datastax.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 6:21 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes get stuck

A thread dump on one of the machine that has a suspiciously high CPU might help figuring out
what it is that is taking all that CPU.


On Wed, Aug 21, 2013 at 8:57 AM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
Some last minute info on this to hopefully enlighten.  We are doing ~200 reads and writes
across our 7 node SSD cluster right now (usually can do closer to 20K reads at least) and
seeing CPU load as follows for the nodes (with some par new to give an idea of GC):

001 – 1200%   (Par New at 120 ms / sec)
002 – 6% (Par New at 0)
003 – 600% (Par New at 45 ms / sec)
004 – 900%
005 – 500%
006 – 10%
007 – 130%

There are no compactions running on 001 however I did see a broken pipe error in the logs
there (see below).  Netstats for 001 shows nothing pending.  It appears that all of the load/latency
is related to one column family.  You can see cfstats & cfhistograms output below and
note that we are using LCS.  I have brought the odd cfhistograms behavior to the thread before
and am not sure what's going on there.  We are in a production down situation right now so
any help would be much appreciated!!!

Column Family: global_user
SSTable count: 7546
SSTables in each level: [2, 10, 106/100, 453, 6975, 0, 0]
Space used (live): 83848742562
Space used (total): 83848742562
Number of Keys (estimate): 549792896
Memtable Columns Count: 526746
Memtable Data Size: 117408252
Memtable Switch Count: 0
Read Count: 11673
Read Latency: 1950.062 ms.
Write Count: 118588
Write Latency: 0.080 ms.
Pending Tasks: 0
Bloom Filter False Positives: 4322
Bloom Filter False Ratio: 0.84066
Bloom Filter Space Used: 383507440
Compacted row minimum size: 73
Compacted row maximum size: 2816159
Compacted row mean size: 324

[kwright@lxpcas001 ~]$ nodetool cfhistograms users global_user
users/global_user histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1               8866                 0                 0                 0              3420
2               1001                 0                 0                 0          99218975
3               1249                 0                 0                 0         319713048
4               1074                 0                 0                 0          25073893
5                132                 0                 0                 0          15359199
6                  0                 0                 0                 0          27794925
7                  0                12                 0                 0           7954974
8                  0                23                 0                 0           7733934
10                 0               184                 0                 0          13276275
12                 0               567                 0                 0           9077508
14                 0              1098                 0                 0           5879292
17                 0              2722                 0                 0           5693471
20                 0              4379                 0                 0           3204131
24                 0              8928                 0                 0           2614995
29                 0             13525                 0                 0           1824584
35                 0             16759                 0                 0           1265911
42                 0             17048                 0                 0            868075
50                 0             14162                 5                 0            596417
60                 0             11806                15                 0            467747
72                 0              8569               108                 0            354276
86                 0              7042               276               227            269987
103                0              5936               372              2972            218931
124                0              4538               577               157            181360
149                0              2981              1076           7388090            144298
179                0              1929              1529          90535838            116628
215                0              1081              1450         182701876             93378
258                0               499              1125         141393480             74052
310                0               124               756          18883224             58617
372                0                31               460          24599272             45453
446                0                25               247          23516772             34310
535                0                10               146          13987584             26168
642                0                20               194          12091458             19965
770                0                 8               196           9269197             14649
924                0                 9               340           8082898             11015
1109               0                 9               225           4762865              8058
1331               0                 9               154           3330110              5866
1597               0                 8               144           2367615              4275
1916               0                 1               188           1633608              3087
2299               0                 4               216           1139820              2196
2759               0                 5               201            819019              1456
3311               0                 4               194            600522              1135
3973               0                 6               181            454566               786
4768               0                13               136            353886               587
5722               0                 6               152            280630               400
6866               0                 5                80            225545               254
8239               0                 6               112            183285               138
9887               0                 0                68            149820               109
11864              0                 5                99            121722               
66
14237              0                57                86             98352               
50
17084              0                18                99             79085               
35
20501              0                 1                93             62423               
11
24601              0                 0                61             49471               
 9
29521              0                 0                69             37395               
 5
35425              0                 4                56             28611               
 6
42510              0                 0                57             21876               
 1
51012              0                 9                60             16105               
 0
61214              0                 0                52             11996               
 0
73457              0                 0                50              8791               
 0
88148              0                 0                38              6430               
 0
105778             0                 0                25              4660               
 0
126934             0                 0                15              3308               
 0
152321             0                 0                 2              2364               
 0
182785             0                 0                 0              1631               
 0
219342             0                 0                 0              1156               
 0
263210             0                 0                 0               887               
 0
315852             0                 0                 0               618               
 0
379022             0                 0                 0               427               
 0
454826             0                 0                 0               272               
 0
545791             0                 0                 0               168               
 0
654949             0                 0                 0               115               
 0
785939             0                 0                 0                61               
 0
943127             0                 0                 0                58               
 0
1131752            0                 0                 0                34               
 0
1358102            0                 0                 0                19               
 0
1629722            0                 0                 0                 9               
 0
1955666            0                 0                 0                 4               
 0
2346799            0                 0                 0                 5               
 0
2816159            0                 0                 0                 2               
 0
3379391            0                 0                 0                 0               
 0
4055269            0                 0                 0                 0               
 0
4866323            0                 0                 0                 0               
 0
5839588            0                 0                 0                 0               
 0
7007506            0                 0                 0                 0               
 0
8409007            0                 0                 0                 0               
 0
10090808           0                 0                 0                 0               
 0
12108970           0                 0                 0                 0               
 0
14530764           0                 0                 0                 0               
 0
17436917           0                 0                 0                 0               
 0
20924300           0                 0                 0                 0               
 0
25109160           0                 0                 0                 0               
 0

ERROR [WRITE-/10.8.44.98<http://10.8.44.98>] 2013-08-21 06:50:25,450 OutboundTcpConnection.java
(line 197) error writing to /10.8.44.98<http://10.8.44.98>
java.lang.RuntimeException: java.io.IOException: Broken pipe
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
at org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at org.apache.cassandra.db.Row$RowSerializer.serialize(Row.java:62)
at org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:78)
at org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:69)
at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:131)
at org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:221)
at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:186)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
at java.nio.channels.Channels.writeFully(Channels.java:98)
at java.nio.channels.Channels.access$000(Channels.java:61)
at java.nio.channels.Channels$1.write(Channels.java:174)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
at org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
... 9 more


From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, August 21, 2013 2:35 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes get stuck

Still looking for help!  We have stopped almost ALL traffic to the cluster and still some
nodes are showing almost 1000% CPU for cassandra with no iostat activity.   We were running
cleanup on one of the nodes that was not showing load spikes however now when I attempt to
stop cleanup there via nodetool stop cleanup the java task for stopping cleanup itself is
at 1500% and has not returned after 2 minutes.  This is VERY odd behavior.  Any ideas?  Hardware
failure?  Network?  We are not seeing anything there but wanted to get ideas.

Thanks

From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, August 20, 2013 8:32 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Nodes get stuck

Hi all,

    We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior recently where 3 of
our nodes get locked up in high load in what appears to be a GC spiral while the rest of the
cluster (7 total nodes) appears fine.  When I run a tpstats, I see the following (assuming
tpstats returns at all) and top shows cassandra pegged at 2000%.  Obviously we have a large
number of blocked reads.  In the past I could explain this due to unexpectedly wide rows however
we have handled that.  When the cluster starts to meltdown like this its hard to get visibility
into what's going on and what triggered the issue as everything starts to pile on.  Opscenter
becomes unusable and because the effected nodes are in GC pressure, getting any data via nodetool
or JMX is also difficult.  What do people do to handle these situations?  We are going to
start graphing reads/writes/sec/CF to Ganglia in the hopes that it helps.

Thanks

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                       256       381     1245117434         0                 0
RequestResponseStage              0         0     1161495947         0                 0
MutationStage                     8         8      481721887         0                 0
ReadRepairStage                   0         0       85770600         0                 0
ReplicateOnWriteStage             0         0       21896804         0                 0
GossipStage                       0         0        1546196         0                 0
AntiEntropyStage                  0         0           5009         0                 0
MigrationStage                    0         0           1082         0                 0
MemtablePostFlusher               0         0          10178         0                 0
FlushWriter                       0         0           6081         0              2075
MiscStage                         0         0             57         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0              6         0                 0
HintedHandoff                     1         1            246         0                 0

Message type           Dropped
RANGE_SLICE                482
READ_REPAIR                  0
BINARY                       0
READ                    515762
MUTATION                    39
_TRACE                       0
REQUEST_RESPONSE            29



Mime
View raw message