incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nate McCall <n...@thelastpickle.com>
Subject Re: Nodes get stuck
Date Wed, 21 Aug 2013 13:33:18 GMT
We use 128m and gc_grace of 300 on a CF with highly transient data (holds
values for client locking via wait-chain algorithm implementation in
hector).

Same minor version upgrades should be painless and do-able with no
downtime.


On Wed, Aug 21, 2013 at 8:28 AM, Keith Wright <kwright@nanigans.com> wrote:

> We have our LCS sstable size at 64 MB and gc grace at 86400.  May I ask
> what values you use?  I saw that in 2.0 they are setting LCS default
> sstable size to 160 MB.
>
> Does anyone see any risk in upgrading from 1.2.4 to 1.2.8?  Upgrade steps
> do not appear to mention any actions required and that a rolling upgrade
> should be safe
>
> From: Nate McCall <nate@thelastpickle.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Wednesday, August 21, 2013 9:07 AM
> To: Cassandra Users <user@cassandra.apache.org>
>
> Subject: Re: Nodes get stuck
>
> Hit send before i saw your update. Yeah turn gc grace way down. You can
> turn your level size up a lot as well.
> On Aug 21, 2013 7:55 AM, "Keith Wright" <kwright@nanigans.com> wrote:
>
>> So the stack appears to be related to walking tombstones for a fetch.
>>  Can you please give me your take on if this is a plausible explanation:
>>
>>    - Given our data model, we can experience wide rows.  We protect
>>    against these by randomly reading a portion on write and if the size is
>>    beyond a certain threshold, we delete data
>>    - This worked VERY well for some time now however perhaps we hit a
>>    row that we deleted and has many tombstones.  The row is being requests
>>    frequently so Cassandra is working very hard to process through all of its
>>    tombstones (currently the RF # of nodes are at high load which again
>>    suggests this).
>>
>> Question is what to do about it?  This is an LCS table with gc grace
>> seconds at 86400.  I assume my only options are to force a major compaction
>> via nodetool compaction or upgrades stables?  How can I validate this is
>> the cause?  How can I prevent it going forward?  Set the gc grace seconds
>> to a much lower value for that table?
>>
>> Thanks all!
>>
>> From: Keith Wright <kwright@nanigans.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Wednesday, August 21, 2013 8:31 AM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Nodes get stuck
>>
>> Thank you for responding.  I did a quick look and my mutation stage
>> threads are currently in TIMED_WAITING (as expected since tpstats shows no
>> active or pending) however most of my read stage threads are Runnable with
>> the stack traces below.  I haven't dug into them yet but thought I would
>> put them out there to see if anyone had any ideas since we are currently in
>> a production down state.
>>
>> Thanks all!
>>
>> Most have the first stack:
>>
>> java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
>>
>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:69)
>>
>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
>> java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>> java.util.TimSort.sort(TimSort.java:203)
>> java.util.TimSort.sort(TimSort.java:173)
>> java.util.Arrays.sort(Arrays.java:659)
>> java.util.Collections.sort(Collections.java:217)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
>> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>>
>> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>>
>> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
>> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>>
>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>>
>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>>
>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>>
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>
>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>>
>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>>
>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>>
>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>>
>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
>> org.apache.cassandra.db.Table.getRow(Table.java:347)
>>
>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>>
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:722)
>>
>> Name: ReadStage:1719
>> State: RUNNABLE
>> Total blocked: 1,005  Total waited: 913
>>
>> Stack trace:
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:252)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
>> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>>
>> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>>
>> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
>> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>>
>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>>
>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>>
>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>>
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>
>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>>
>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>>
>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>>
>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>>
>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
>> org.apache.cassandra.db.Table.getRow(Table.java:347)
>>
>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>>
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:722)
>>
>> Name: ReadStage:1722
>> State: RUNNABLE
>> Total blocked: 1,001  Total waited: 897
>>
>> Stack trace:
>> org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:58)
>> org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:26)
>>
>> org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:229)
>>
>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:81)
>>
>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
>> java.util.TimSort.binarySort(TimSort.java:265)
>> java.util.TimSort.sort(TimSort.java:208)
>> java.util.TimSort.sort(TimSort.java:173)
>> java.util.Arrays.sort(Arrays.java:659)
>> java.util.Collections.sort(Collections.java:217)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>>
>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
>> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>>
>> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>>
>> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
>> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>>
>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>>
>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>>
>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>>
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>
>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>>
>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>>
>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>>
>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>>
>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>>
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
>> org.apache.cassandra.db.Table.getRow(Table.java:347)
>>
>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>>
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:722)
>>
>>
>> From: Sylvain Lebresne <sylvain@datastax.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Wednesday, August 21, 2013 6:21 AM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Nodes get stuck
>>
>> A thread dump on one of the machine that has a suspiciously high CPU
>> might help figuring out what it is that is taking all that CPU.
>>
>>
>> On Wed, Aug 21, 2013 at 8:57 AM, Keith Wright <kwright@nanigans.com>wrote:
>>
>>> Some last minute info on this to hopefully enlighten.  We are doing ~200
>>> reads and writes across our 7 node SSD cluster right now (usually can do
>>> closer to 20K reads at least) and seeing CPU load as follows for the nodes
>>> (with some par new to give an idea of GC):
>>>
>>> 001 – 1200%   (Par New at 120 ms / sec)
>>> 002 – 6% (Par New at 0)
>>> 003 – 600% (Par New at 45 ms / sec)
>>> 004 – 900%
>>> 005 – 500%
>>> 006 – 10%
>>> 007 – 130%
>>>
>>> There are no compactions running on 001 however I did see a broken pipe
>>> error in the logs there (see below).  Netstats for 001 shows nothing
>>> pending.  It appears that all of the load/latency is related to one column
>>> family.  You can see cfstats & cfhistograms output below and note that we
>>> are using LCS.  I have brought the odd cfhistograms behavior to the thread
>>> before and am not sure what's going on there.  We are in a production down
>>> situation right now so any help would be much appreciated!!!
>>>
>>> Column Family: global_user
>>> SSTable count: 7546
>>> SSTables in each level: [2, 10, 106/100, 453, 6975, 0, 0]
>>> Space used (live): 83848742562
>>> Space used (total): 83848742562
>>> Number of Keys (estimate): 549792896
>>> Memtable Columns Count: 526746
>>> Memtable Data Size: 117408252
>>> Memtable Switch Count: 0
>>> Read Count: 11673
>>> Read Latency: 1950.062 ms.
>>> Write Count: 118588
>>> Write Latency: 0.080 ms.
>>> Pending Tasks: 0
>>> Bloom Filter False Positives: 4322
>>> Bloom Filter False Ratio: 0.84066
>>> Bloom Filter Space Used: 383507440
>>> Compacted row minimum size: 73
>>> Compacted row maximum size: 2816159
>>> Compacted row mean size: 324
>>>
>>> [kwright@lxpcas001 ~]$ nodetool cfhistograms users global_user
>>> users/global_user histograms
>>> Offset      SSTables     Write Latency      Read Latency          Row
>>> Size      Column Count
>>> 1               8866                 0                 0
>>> 0              3420
>>> 2               1001                 0                 0
>>> 0          99218975
>>> 3               1249                 0                 0
>>> 0         319713048
>>> 4               1074                 0                 0
>>> 0          25073893
>>> 5                132                 0                 0
>>> 0          15359199
>>> 6                  0                 0                 0
>>> 0          27794925
>>> 7                  0                12                 0
>>> 0           7954974
>>> 8                  0                23                 0
>>> 0           7733934
>>> 10                 0               184                 0
>>> 0          13276275
>>> 12                 0               567                 0
>>> 0           9077508
>>> 14                 0              1098                 0
>>> 0           5879292
>>> 17                 0              2722                 0
>>> 0           5693471
>>> 20                 0              4379                 0
>>> 0           3204131
>>> 24                 0              8928                 0
>>> 0           2614995
>>> 29                 0             13525                 0
>>> 0           1824584
>>> 35                 0             16759                 0
>>> 0           1265911
>>> 42                 0             17048                 0
>>> 0            868075
>>> 50                 0             14162                 5
>>> 0            596417
>>> 60                 0             11806                15
>>> 0            467747
>>> 72                 0              8569               108
>>> 0            354276
>>> 86                 0              7042               276
>>> 227            269987
>>> 103                0              5936               372
>>>  2972            218931
>>> 124                0              4538               577
>>> 157            181360
>>> 149                0              2981              1076
>>> 7388090            144298
>>> 179                0              1929              1529
>>>  90535838            116628
>>> 215                0              1081              1450
>>> 182701876             93378
>>> 258                0               499              1125
>>> 141393480             74052
>>> 310                0               124               756
>>>  18883224             58617
>>> 372                0                31               460
>>>  24599272             45453
>>> 446                0                25               247
>>>  23516772             34310
>>> 535                0                10               146
>>>  13987584             26168
>>> 642                0                20               194
>>>  12091458             19965
>>> 770                0                 8               196
>>> 9269197             14649
>>> 924                0                 9               340
>>> 8082898             11015
>>> 1109               0                 9               225
>>> 4762865              8058
>>> 1331               0                 9               154
>>> 3330110              5866
>>> 1597               0                 8               144
>>> 2367615              4275
>>> 1916               0                 1               188
>>> 1633608              3087
>>> 2299               0                 4               216
>>> 1139820              2196
>>> 2759               0                 5               201
>>>  819019              1456
>>> 3311               0                 4               194
>>>  600522              1135
>>> 3973               0                 6               181
>>>  454566               786
>>> 4768               0                13               136
>>>  353886               587
>>> 5722               0                 6               152
>>>  280630               400
>>> 6866               0                 5                80
>>>  225545               254
>>> 8239               0                 6               112
>>>  183285               138
>>> 9887               0                 0                68
>>>  149820               109
>>> 11864              0                 5                99
>>>  121722                66
>>> 14237              0                57                86
>>> 98352                50
>>> 17084              0                18                99
>>> 79085                35
>>> 20501              0                 1                93
>>> 62423                11
>>> 24601              0                 0                61
>>> 49471                 9
>>> 29521              0                 0                69
>>> 37395                 5
>>> 35425              0                 4                56
>>> 28611                 6
>>> 42510              0                 0                57
>>> 21876                 1
>>> 51012              0                 9                60
>>> 16105                 0
>>> 61214              0                 0                52
>>> 11996                 0
>>> 73457              0                 0                50
>>>  8791                 0
>>> 88148              0                 0                38
>>>  6430                 0
>>> 105778             0                 0                25
>>>  4660                 0
>>> 126934             0                 0                15
>>>  3308                 0
>>> 152321             0                 0                 2
>>>  2364                 0
>>> 182785             0                 0                 0
>>>  1631                 0
>>> 219342             0                 0                 0
>>>  1156                 0
>>> 263210             0                 0                 0
>>> 887                 0
>>> 315852             0                 0                 0
>>> 618                 0
>>> 379022             0                 0                 0
>>> 427                 0
>>> 454826             0                 0                 0
>>> 272                 0
>>> 545791             0                 0                 0
>>> 168                 0
>>> 654949             0                 0                 0
>>> 115                 0
>>> 785939             0                 0                 0
>>>  61                 0
>>> 943127             0                 0                 0
>>>  58                 0
>>> 1131752            0                 0                 0
>>>  34                 0
>>> 1358102            0                 0                 0
>>>  19                 0
>>> 1629722            0                 0                 0
>>> 9                 0
>>> 1955666            0                 0                 0
>>> 4                 0
>>> 2346799            0                 0                 0
>>> 5                 0
>>> 2816159            0                 0                 0
>>> 2                 0
>>> 3379391            0                 0                 0
>>> 0                 0
>>> 4055269            0                 0                 0
>>> 0                 0
>>> 4866323            0                 0                 0
>>> 0                 0
>>> 5839588            0                 0                 0
>>> 0                 0
>>> 7007506            0                 0                 0
>>> 0                 0
>>> 8409007            0                 0                 0
>>> 0                 0
>>> 10090808           0                 0                 0
>>> 0                 0
>>> 12108970           0                 0                 0
>>> 0                 0
>>> 14530764           0                 0                 0
>>> 0                 0
>>> 17436917           0                 0                 0
>>> 0                 0
>>> 20924300           0                 0                 0
>>> 0                 0
>>> 25109160           0                 0                 0
>>> 0                 0
>>>
>>> ERROR [WRITE-/10.8.44.98] 2013-08-21 06:50:25,450
>>> OutboundTcpConnection.java (line 197) error writing to /10.8.44.98
>>> java.lang.RuntimeException: java.io.IOException: Broken pipe
>>> at
>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
>>> at
>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
>>> at
>>> org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
>>> at org.apache.cassandra.db.Row$RowSerializer.serialize(Row.java:62)
>>> at
>>> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:78)
>>> at
>>> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:69)
>>> at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:131)
>>> at
>>> org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:221)
>>> at
>>> org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:186)
>>> at
>>> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
>>> Caused by: java.io.IOException: Broken pipe
>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>>> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
>>> at java.nio.channels.Channels.writeFully(Channels.java:98)
>>> at java.nio.channels.Channels.access$000(Channels.java:61)
>>> at java.nio.channels.Channels$1.write(Channels.java:174)
>>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>>> at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
>>> at
>>> org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
>>> at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
>>> at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>> at
>>> org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
>>> at
>>> org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
>>> at
>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
>>> ... 9 more
>>>
>>>
>>> From: Keith Wright <kwright@nanigans.com>
>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Date: Wednesday, August 21, 2013 2:35 AM
>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Subject: Re: Nodes get stuck
>>>
>>> Still looking for help!  We have stopped almost ALL traffic to the
>>> cluster and still some nodes are showing almost 1000% CPU for cassandra
>>> with no iostat activity.   We were running cleanup on one of the nodes that
>>> was not showing load spikes however now when I attempt to stop cleanup
>>> there via nodetool stop cleanup the java task for stopping cleanup itself
>>> is at 1500% and has not returned after 2 minutes.  This is VERY odd
>>> behavior.  Any ideas?  Hardware failure?  Network?  We are not seeing
>>> anything there but wanted to get ideas.
>>>
>>> Thanks
>>>
>>> From: Keith Wright <kwright@nanigans.com>
>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Date: Tuesday, August 20, 2013 8:32 PM
>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Subject: Nodes get stuck
>>>
>>> Hi all,
>>>
>>>     We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior
>>> recently where 3 of our nodes get locked up in high load in what appears to
>>> be a GC spiral while the rest of the cluster (7 total nodes) appears fine.
>>>  When I run a tpstats, I see the following (assuming tpstats returns at
>>> all) and top shows cassandra pegged at 2000%.  Obviously we have a large
>>> number of blocked reads.  In the past I could explain this due to
>>> unexpectedly wide rows however we have handled that.  When the cluster
>>> starts to meltdown like this its hard to get visibility into what's going
>>> on and what triggered the issue as everything starts to pile on.  Opscenter
>>> becomes unusable and because the effected nodes are in GC pressure, getting
>>> any data via nodetool or JMX is also difficult.  What do people do to
>>> handle these situations?  We are going to start graphing
>>> reads/writes/sec/CF to Ganglia in the hopes that it helps.
>>>
>>> Thanks
>>>
>>> Pool Name                    Active   Pending      Completed   Blocked
>>>  All time blocked
>>> ReadStage                       256       381     1245117434         0
>>>               0
>>> RequestResponseStage              0         0     1161495947         0
>>>               0
>>> MutationStage                     8         8      481721887         0
>>>               0
>>> ReadRepairStage                   0         0       85770600         0
>>>               0
>>> ReplicateOnWriteStage             0         0       21896804         0
>>>               0
>>> GossipStage                       0         0        1546196         0
>>>               0
>>> AntiEntropyStage                  0         0           5009         0
>>>               0
>>> MigrationStage                    0         0           1082         0
>>>               0
>>> MemtablePostFlusher               0         0          10178         0
>>>               0
>>> FlushWriter                       0         0           6081         0
>>>            2075
>>> MiscStage                         0         0             57         0
>>>               0
>>> commitlog_archiver                0         0              0         0
>>>               0
>>> AntiEntropySessions               0         0              0         0
>>>               0
>>> InternalResponseStage             0         0              6         0
>>>               0
>>> HintedHandoff                     1         1            246         0
>>>               0
>>>
>>> Message type           Dropped
>>> RANGE_SLICE                482
>>> READ_REPAIR                  0
>>> BINARY                       0
>>> READ                    515762
>>> MUTATION                    39
>>> _TRACE                       0
>>> REQUEST_RESPONSE            29
>>>
>>>
>>

Mime
View raw message