incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: Nodes get stuck
Date Wed, 21 Aug 2013 15:21:41 GMT
>     n other words, is it expected that the same interval tree issue would
> occur during compactions?
>

Yep.


>
> Thanks
>
> From: Keith Wright <kwright@nanigans.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Wednesday, August 21, 2013 9:48 AM
>
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Nodes get stuck
>
> Are many people running at 1.2.8?  Any issues?  Just nervous about running
> on the latest.  Would prefer to be a couple of versions behind as new bugs
> due to tend to popup.
>
> Thanks all
>
> From: Nate McCall <nate@thelastpickle.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Wednesday, August 21, 2013 9:33 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Nodes get stuck
>
> We use 128m and gc_grace of 300 on a CF with highly transient data (holds
> values for client locking via wait-chain algorithm implementation in
> hector).
>
> Same minor version upgrades should be painless and do-able with no
> downtime.
>
>
> On Wed, Aug 21, 2013 at 8:28 AM, Keith Wright <kwright@nanigans.com>wrote:
>
>> We have our LCS sstable size at 64 MB and gc grace at 86400.  May I ask
>> what values you use?  I saw that in 2.0 they are setting LCS default
>> sstable size to 160 MB.
>>
>> Does anyone see any risk in upgrading from 1.2.4 to 1.2.8?  Upgrade steps
>> do not appear to mention any actions required and that a rolling upgrade
>> should be safe
>>
>> From: Nate McCall <nate@thelastpickle.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Wednesday, August 21, 2013 9:07 AM
>> To: Cassandra Users <user@cassandra.apache.org>
>>
>> Subject: Re: Nodes get stuck
>>
>> Hit send before i saw your update. Yeah turn gc grace way down. You can
>> turn your level size up a lot as well.
>> On Aug 21, 2013 7:55 AM, "Keith Wright" <kwright@nanigans.com> wrote:
>>
>>> So the stack appears to be related to walking tombstones for a fetch.
>>>  Can you please give me your take on if this is a plausible explanation:
>>>
>>>    - Given our data model, we can experience wide rows.  We protect
>>>    against these by randomly reading a portion on write and if the size is
>>>    beyond a certain threshold, we delete data
>>>    - This worked VERY well for some time now however perhaps we hit a
>>>    row that we deleted and has many tombstones.  The row is being requests
>>>    frequently so Cassandra is working very hard to process through all of its
>>>    tombstones (currently the RF # of nodes are at high load which again
>>>    suggests this).
>>>
>>> Question is what to do about it?  This is an LCS table with gc grace
>>> seconds at 86400.  I assume my only options are to force a major compaction
>>> via nodetool compaction or upgrades stables?  How can I validate this is
>>> the cause?  How can I prevent it going forward?  Set the gc grace seconds
>>> to a much lower value for that table?
>>>
>>> Thanks all!
>>>
>>> From: Keith Wright <kwright@nanigans.com>
>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Date: Wednesday, August 21, 2013 8:31 AM
>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Subject: Re: Nodes get stuck
>>>
>>> Thank you for responding.  I did a quick look and my mutation stage
>>> threads are currently in TIMED_WAITING (as expected since tpstats shows no
>>> active or pending) however most of my read stage threads are Runnable with
>>> the stack traces below.  I haven't dug into them yet but thought I would
>>> put them out there to see if anyone had any ideas since we are currently in
>>> a production down state.
>>>
>>> Thanks all!
>>>
>>> Most have the first stack:
>>>
>>> java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
>>>
>>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:69)
>>>
>>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
>>> java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>>> java.util.TimSort.sort(TimSort.java:203)
>>> java.util.TimSort.sort(TimSort.java:173)
>>> java.util.Arrays.sort(Arrays.java:659)
>>> java.util.Collections.sort(Collections.java:217)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>>> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>>> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
>>> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>>>
>>> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>>>
>>> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
>>> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>>>
>>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>>>
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>>>
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>>>
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>>
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>>
>>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>>>
>>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>>>
>>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
>>> org.apache.cassandra.db.Table.getRow(Table.java:347)
>>>
>>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>>> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>>>
>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> java.lang.Thread.run(Thread.java:722)
>>>
>>> Name: ReadStage:1719
>>> State: RUNNABLE
>>> Total blocked: 1,005  Total waited: 913
>>>
>>> Stack trace:
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:252)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>>> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
>>> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>>>
>>> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>>>
>>> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
>>> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>>>
>>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>>>
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>>>
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>>>
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>>
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>>
>>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>>>
>>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>>>
>>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
>>> org.apache.cassandra.db.Table.getRow(Table.java:347)
>>>
>>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>>> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>>>
>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> java.lang.Thread.run(Thread.java:722)
>>>
>>> Name: ReadStage:1722
>>> State: RUNNABLE
>>> Total blocked: 1,001  Total waited: 897
>>>
>>> Stack trace:
>>> org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:58)
>>> org.apache.cassandra.db.marshal.Int32Type.compare(Int32Type.java:26)
>>>
>>> org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:229)
>>>
>>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:81)
>>>
>>> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
>>> java.util.TimSort.binarySort(TimSort.java:265)
>>> java.util.TimSort.sort(TimSort.java:208)
>>> java.util.TimSort.sort(TimSort.java:173)
>>> java.util.Arrays.sort(Arrays.java:659)
>>> java.util.Collections.sort(Collections.java:217)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:255)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:281)
>>>
>>> org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>>> org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>>> org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81)
>>> org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175)
>>>
>>> org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40)
>>>
>>> org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51)
>>> org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
>>>
>>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143)
>>>
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
>>>
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
>>>
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>>
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>>
>>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:157)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>>>
>>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>>>
>>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:293)
>>>
>>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1357)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
>>>
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1126)
>>> org.apache.cassandra.db.Table.getRow(Table.java:347)
>>>
>>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>>> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
>>>
>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> java.lang.Thread.run(Thread.java:722)
>>>
>>>
>>> From: Sylvain Lebresne <sylvain@datastax.com>
>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Date: Wednesday, August 21, 2013 6:21 AM
>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Subject: Re: Nodes get stuck
>>>
>>> A thread dump on one of the machine that has a suspiciously high CPU
>>> might help figuring out what it is that is taking all that CPU.
>>>
>>>
>>> On Wed, Aug 21, 2013 at 8:57 AM, Keith Wright <kwright@nanigans.com>wrote:
>>>
>>>> Some last minute info on this to hopefully enlighten.  We are doing
>>>> ~200 reads and writes across our 7 node SSD cluster right now (usually can
>>>> do closer to 20K reads at least) and seeing CPU load as follows for the
>>>> nodes (with some par new to give an idea of GC):
>>>>
>>>> 001 – 1200%   (Par New at 120 ms / sec)
>>>> 002 – 6% (Par New at 0)
>>>> 003 – 600% (Par New at 45 ms / sec)
>>>> 004 – 900%
>>>> 005 – 500%
>>>> 006 – 10%
>>>> 007 – 130%
>>>>
>>>> There are no compactions running on 001 however I did see a broken pipe
>>>> error in the logs there (see below).  Netstats for 001 shows nothing
>>>> pending.  It appears that all of the load/latency is related to one column
>>>> family.  You can see cfstats & cfhistograms output below and note that
we
>>>> are using LCS.  I have brought the odd cfhistograms behavior to the thread
>>>> before and am not sure what's going on there.  We are in a production down
>>>> situation right now so any help would be much appreciated!!!
>>>>
>>>> Column Family: global_user
>>>> SSTable count: 7546
>>>> SSTables in each level: [2, 10, 106/100, 453, 6975, 0, 0]
>>>> Space used (live): 83848742562
>>>> Space used (total): 83848742562
>>>> Number of Keys (estimate): 549792896
>>>> Memtable Columns Count: 526746
>>>> Memtable Data Size: 117408252
>>>> Memtable Switch Count: 0
>>>> Read Count: 11673
>>>> Read Latency: 1950.062 ms.
>>>> Write Count: 118588
>>>> Write Latency: 0.080 ms.
>>>> Pending Tasks: 0
>>>> Bloom Filter False Positives: 4322
>>>> Bloom Filter False Ratio: 0.84066
>>>> Bloom Filter Space Used: 383507440
>>>> Compacted row minimum size: 73
>>>> Compacted row maximum size: 2816159
>>>> Compacted row mean size: 324
>>>>
>>>> [kwright@lxpcas001 ~]$ nodetool cfhistograms users global_user
>>>> users/global_user histograms
>>>> Offset      SSTables     Write Latency      Read Latency          Row
>>>> Size      Column Count
>>>> 1               8866                 0                 0
>>>>   0              3420
>>>> 2               1001                 0                 0
>>>>   0          99218975
>>>> 3               1249                 0                 0
>>>>   0         319713048
>>>> 4               1074                 0                 0
>>>>   0          25073893
>>>> 5                132                 0                 0
>>>>   0          15359199
>>>> 6                  0                 0                 0
>>>>   0          27794925
>>>> 7                  0                12                 0
>>>>   0           7954974
>>>> 8                  0                23                 0
>>>>   0           7733934
>>>> 10                 0               184                 0
>>>>   0          13276275
>>>> 12                 0               567                 0
>>>>   0           9077508
>>>> 14                 0              1098                 0
>>>>   0           5879292
>>>> 17                 0              2722                 0
>>>>   0           5693471
>>>> 20                 0              4379                 0
>>>>   0           3204131
>>>> 24                 0              8928                 0
>>>>   0           2614995
>>>> 29                 0             13525                 0
>>>>   0           1824584
>>>> 35                 0             16759                 0
>>>>   0           1265911
>>>> 42                 0             17048                 0
>>>>   0            868075
>>>> 50                 0             14162                 5
>>>>   0            596417
>>>> 60                 0             11806                15
>>>>   0            467747
>>>> 72                 0              8569               108
>>>>   0            354276
>>>> 86                 0              7042               276
>>>> 227            269987
>>>> 103                0              5936               372
>>>>  2972            218931
>>>> 124                0              4538               577
>>>> 157            181360
>>>> 149                0              2981              1076
>>>> 7388090            144298
>>>> 179                0              1929              1529
>>>>  90535838            116628
>>>> 215                0              1081              1450
>>>> 182701876             93378
>>>> 258                0               499              1125
>>>> 141393480             74052
>>>> 310                0               124               756
>>>>  18883224             58617
>>>> 372                0                31               460
>>>>  24599272             45453
>>>> 446                0                25               247
>>>>  23516772             34310
>>>> 535                0                10               146
>>>>  13987584             26168
>>>> 642                0                20               194
>>>>  12091458             19965
>>>> 770                0                 8               196
>>>> 9269197             14649
>>>> 924                0                 9               340
>>>> 8082898             11015
>>>> 1109               0                 9               225
>>>> 4762865              8058
>>>> 1331               0                 9               154
>>>> 3330110              5866
>>>> 1597               0                 8               144
>>>> 2367615              4275
>>>> 1916               0                 1               188
>>>> 1633608              3087
>>>> 2299               0                 4               216
>>>> 1139820              2196
>>>> 2759               0                 5               201
>>>>  819019              1456
>>>> 3311               0                 4               194
>>>>  600522              1135
>>>> 3973               0                 6               181
>>>>  454566               786
>>>> 4768               0                13               136
>>>>  353886               587
>>>> 5722               0                 6               152
>>>>  280630               400
>>>> 6866               0                 5                80
>>>>  225545               254
>>>> 8239               0                 6               112
>>>>  183285               138
>>>> 9887               0                 0                68
>>>>  149820               109
>>>> 11864              0                 5                99
>>>>  121722                66
>>>> 14237              0                57                86
>>>> 98352                50
>>>> 17084              0                18                99
>>>> 79085                35
>>>> 20501              0                 1                93
>>>> 62423                11
>>>> 24601              0                 0                61
>>>> 49471                 9
>>>> 29521              0                 0                69
>>>> 37395                 5
>>>> 35425              0                 4                56
>>>> 28611                 6
>>>> 42510              0                 0                57
>>>> 21876                 1
>>>> 51012              0                 9                60
>>>> 16105                 0
>>>> 61214              0                 0                52
>>>> 11996                 0
>>>> 73457              0                 0                50
>>>>  8791                 0
>>>> 88148              0                 0                38
>>>>  6430                 0
>>>> 105778             0                 0                25
>>>>  4660                 0
>>>> 126934             0                 0                15
>>>>  3308                 0
>>>> 152321             0                 0                 2
>>>>  2364                 0
>>>> 182785             0                 0                 0
>>>>  1631                 0
>>>> 219342             0                 0                 0
>>>>  1156                 0
>>>> 263210             0                 0                 0
>>>> 887                 0
>>>> 315852             0                 0                 0
>>>> 618                 0
>>>> 379022             0                 0                 0
>>>> 427                 0
>>>> 454826             0                 0                 0
>>>> 272                 0
>>>> 545791             0                 0                 0
>>>> 168                 0
>>>> 654949             0                 0                 0
>>>> 115                 0
>>>> 785939             0                 0                 0
>>>>  61                 0
>>>> 943127             0                 0                 0
>>>>  58                 0
>>>> 1131752            0                 0                 0
>>>>  34                 0
>>>> 1358102            0                 0                 0
>>>>  19                 0
>>>> 1629722            0                 0                 0
>>>>   9                 0
>>>> 1955666            0                 0                 0
>>>>   4                 0
>>>> 2346799            0                 0                 0
>>>>   5                 0
>>>> 2816159            0                 0                 0
>>>>   2                 0
>>>> 3379391            0                 0                 0
>>>>   0                 0
>>>> 4055269            0                 0                 0
>>>>   0                 0
>>>> 4866323            0                 0                 0
>>>>   0                 0
>>>> 5839588            0                 0                 0
>>>>   0                 0
>>>> 7007506            0                 0                 0
>>>>   0                 0
>>>> 8409007            0                 0                 0
>>>>   0                 0
>>>> 10090808           0                 0                 0
>>>>   0                 0
>>>> 12108970           0                 0                 0
>>>>   0                 0
>>>> 14530764           0                 0                 0
>>>>   0                 0
>>>> 17436917           0                 0                 0
>>>>   0                 0
>>>> 20924300           0                 0                 0
>>>>   0                 0
>>>> 25109160           0                 0                 0
>>>>   0                 0
>>>>
>>>> ERROR [WRITE-/10.8.44.98] 2013-08-21 06:50:25,450
>>>> OutboundTcpConnection.java (line 197) error writing to /10.8.44.98
>>>> java.lang.RuntimeException: java.io.IOException: Broken pipe
>>>> at
>>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
>>>> at
>>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
>>>> at
>>>> org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
>>>> at org.apache.cassandra.db.Row$RowSerializer.serialize(Row.java:62)
>>>> at
>>>> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:78)
>>>> at
>>>> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:69)
>>>> at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:131)
>>>> at
>>>> org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:221)
>>>> at
>>>> org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:186)
>>>> at
>>>> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
>>>> Caused by: java.io.IOException: Broken pipe
>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>>>> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
>>>> at java.nio.channels.Channels.writeFully(Channels.java:98)
>>>> at java.nio.channels.Channels.access$000(Channels.java:61)
>>>> at java.nio.channels.Channels$1.write(Channels.java:174)
>>>> at
>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>>>> at
>>>> org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
>>>> at
>>>> org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
>>>> at
>>>> org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>>> at
>>>> org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
>>>> at
>>>> org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
>>>> at
>>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
>>>> ... 9 more
>>>>
>>>>
>>>> From: Keith Wright <kwright@nanigans.com>
>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> Date: Wednesday, August 21, 2013 2:35 AM
>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> Subject: Re: Nodes get stuck
>>>>
>>>> Still looking for help!  We have stopped almost ALL traffic to the
>>>> cluster and still some nodes are showing almost 1000% CPU for cassandra
>>>> with no iostat activity.   We were running cleanup on one of the nodes that
>>>> was not showing load spikes however now when I attempt to stop cleanup
>>>> there via nodetool stop cleanup the java task for stopping cleanup itself
>>>> is at 1500% and has not returned after 2 minutes.  This is VERY odd
>>>> behavior.  Any ideas?  Hardware failure?  Network?  We are not seeing
>>>> anything there but wanted to get ideas.
>>>>
>>>> Thanks
>>>>
>>>> From: Keith Wright <kwright@nanigans.com>
>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> Date: Tuesday, August 20, 2013 8:32 PM
>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> Subject: Nodes get stuck
>>>>
>>>> Hi all,
>>>>
>>>>     We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior
>>>> recently where 3 of our nodes get locked up in high load in what appears
to
>>>> be a GC spiral while the rest of the cluster (7 total nodes) appears fine.
>>>>  When I run a tpstats, I see the following (assuming tpstats returns at
>>>> all) and top shows cassandra pegged at 2000%.  Obviously we have a large
>>>> number of blocked reads.  In the past I could explain this due to
>>>> unexpectedly wide rows however we have handled that.  When the cluster
>>>> starts to meltdown like this its hard to get visibility into what's going
>>>> on and what triggered the issue as everything starts to pile on.  Opscenter
>>>> becomes unusable and because the effected nodes are in GC pressure, getting
>>>> any data via nodetool or JMX is also difficult.  What do people do to
>>>> handle these situations?  We are going to start graphing
>>>> reads/writes/sec/CF to Ganglia in the hopes that it helps.
>>>>
>>>> Thanks
>>>>
>>>> Pool Name                    Active   Pending      Completed   Blocked
>>>>  All time blocked
>>>> ReadStage                       256       381     1245117434         0
>>>>                 0
>>>> RequestResponseStage              0         0     1161495947         0
>>>>                 0
>>>> MutationStage                     8         8      481721887         0
>>>>                 0
>>>> ReadRepairStage                   0         0       85770600         0
>>>>                 0
>>>> ReplicateOnWriteStage             0         0       21896804         0
>>>>                 0
>>>> GossipStage                       0         0        1546196         0
>>>>                 0
>>>> AntiEntropyStage                  0         0           5009         0
>>>>                 0
>>>> MigrationStage                    0         0           1082         0
>>>>                 0
>>>> MemtablePostFlusher               0         0          10178         0
>>>>                 0
>>>> FlushWriter                       0         0           6081         0
>>>>              2075
>>>> MiscStage                         0         0             57         0
>>>>                 0
>>>> commitlog_archiver                0         0              0         0
>>>>                 0
>>>> AntiEntropySessions               0         0              0         0
>>>>                 0
>>>> InternalResponseStage             0         0              6         0
>>>>                 0
>>>> HintedHandoff                     1         1            246         0
>>>>                 0
>>>>
>>>> Message type           Dropped
>>>> RANGE_SLICE                482
>>>> READ_REPAIR                  0
>>>> BINARY                       0
>>>> READ                    515762
>>>> MUTATION                    39
>>>> _TRACE                       0
>>>> REQUEST_RESPONSE            29
>>>>
>>>>
>>>
>

Mime
View raw message