cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8295) Cassandra runs OOM @ java.util.concurrent.ConcurrentSkipListMap$HeadIndex
Date Mon, 17 Nov 2014 23:18:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215370#comment-14215370
] 

Jonathan Ellis commented on CASSANDRA-8295:
-------------------------------------------

Unthrottling compaction will make it worse, not better.

Fundamentally you can only throw data at the disks so fast.  Cassandra will start sending
back timeout exceptions if you exceed that and it has to load shed ("MUTATION messages dropped").
 At that point you can either respect the load shed and back off, add capacity to meet your
desired ingest rate.  Right now you are doing neither and suffering for it.

> Cassandra runs OOM @ java.util.concurrent.ConcurrentSkipListMap$HeadIndex
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8295
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8295
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE 4.5.3 Cassandra 2.0.11.82
>            Reporter: Jose Martinez Poblete
>         Attachments: alln01-ats-cas3.cassandra.yaml, output.tgz, system.tgz, system.tgz.1,
system.tgz.2, system.tgz.3
>
>
> Customer runs a 3 node cluster 
> Their dataset is less than 1Tb and during data load, one of the nodes enter a GC death
spiral:
> {noformat}
>  INFO [ScheduledTasks:1] 2014-11-07 23:31:08,094 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
3348 ms for 2 collections, 1658268944 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:40:58,486 GCInspector.java (line 116) GC for ParNew:
442 ms for 2 collections, 6079570032 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:40:58,487 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
7351 ms for 2 collections, 6084678280 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:01,836 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
603 ms for 1 collections, 7132546096 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:09,626 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
761 ms for 1 collections, 7286946984 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:15,265 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
703 ms for 1 collections, 7251213520 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:25,027 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
1205 ms for 1 collections, 6507586104 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:41,374 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
13835 ms for 3 collections, 6514187192 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:54,137 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
6834 ms for 2 collections, 6521656200 used; max is 8375238656
> ...
>  INFO [ScheduledTasks:1] 2014-11-08 12:13:11,086 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
43967 ms for 2 collections, 8368777672 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 12:14:14,151 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
63968 ms for 3 collections, 8369623824 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 12:14:55,643 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
41307 ms for 2 collections, 8370115376 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 12:20:06,197 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
309634 ms for 15 collections, 8374994928 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 13:07:33,617 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
2681100 ms for 143 collections, 8347631560 used; max is 8375238656
> {noformat} 
> Their application waits 1 minute before a retry when a timeout is returned
> This is what we find on their heapdumps:
> {noformat}
> Class Name                                                                          
                                                                                         
                                                                                         
                                | Shallow Heap | Retained Heap | Percentage
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> org.apache.cassandra.db.Memtable @ 0x773f52f80                                      
                                                                                         
                                                                                         
                                |           72 | 8,086,073,504 |     96.66%
> |- java.util.concurrent.ConcurrentSkipListMap @ 0x724508fe8                         
                                                                                         
                                                                                         
                                |           48 | 8,086,073,320 |     96.66%
> |  |- java.util.concurrent.ConcurrentSkipListMap$HeadIndex @ 0x64f9219a0            
                                                                                         
                                                                                         
                                |           32 | 8,086,073,256 |     96.66%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x614b081a8              
                                                                                         
                                                                                         
                                |           24 |    16,230,976 |      0.19%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x7da171948              
                                                                                         
                                                                                         
                                |           24 |     4,922,288 |      0.06%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x7f4518a80              
                                                                                         
                                                                                         
                                |           24 |     4,405,496 |      0.05%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x611d69d10              
                                                                                         
                                                                                         
                                |           24 |     3,737,672 |      0.04%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x71cd2fae8              
                                                                                         
                                                                                         
                                |           24 |     2,921,048 |      0.03%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$HeadIndex @ 0x728faed50         
                                                                                         
                                                                                         
                                |           32 |     2,012,592 |      0.02%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x6387eb950              
                                                                                         
                                                                                         
                                |           24 |     1,641,696 |      0.02%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x727f474f0              
                                                                                         
                                                                                         
                                |           24 |     1,328,936 |      0.02%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x70d7a02b0              
                                                                                         
                                                                                         
                                |           24 |     1,050,624 |      0.01%
> |  |  |- byte[1048576] @ 0x7d87873d8  .........8.........CS.l`...attributes...slot..............attributes...runtime......A..<x.........C.......attributes...procgid.87.....CS.`....attributes...bflush.00.....CV......attributes...username........uV....server.f1432541.........8...server......A..<...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x60ab7b920  .....7...p...attributes...tottime....../..%....area......0.......attributes...lineid.56.....7.i.....attributes...tottime.156258924.....0B)\....container.4...../.......server....,PTXCALsdihqprod1\sdihqprod1...../.......machine.fxcdom1.....7.i.....attributes...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x609fb54f8  .....E.......attributes...lineid.901137423.....E.......attributes...testr1.1413.....E.......attributes...testr2.M393B1K70QB0-YK02014-01-03
06:46:31.....E.......attributes...tenum1name.EFSTLOOP.....CV......attributes...numunits.1.....E.......attributes...pa...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x60a0b5508  .....E.z.....area......?.......attributes...labelnum.SYSFA.....0"U.....attributes...testr1name.D75165799...../..^....attributes...crc.Hexload_Bootloader.....E.TR....machine......0.......attributes...bflush....../..&....attributes...majline....../._.P...att...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d8f5e2b8  ......B9.....machine.solfr5.......L.....attributes...runtime.146.............attributes...tottime.109.......t.h...attributes...bmap.0......B9.....uuttype.VIP2-40=.......L.....attributes...cpptimeid.2006-04-11
10:53:48.............attributes...partnum.73-91...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d905e2c8  .....E.|.x...attributes...runtime.310.....E.P.0...attributes...partnum.15-13637-02.....E./<....area.SYSFA.....E./<....passfail.S.....E.|.x...attributes...testr1.1413.....E.P.0...attributes...partnum2.15-13637-02.....E./<....container....T.....E./<....attri...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d915e2d8  ...../..l........../..l....server.sdihqprod1\sdihqprod1...../..l....machine.fxcdom1...../..l....uuttype.73-12304-03...../..l....area.PASTE...../..l....passfail.P...../..l....container........../..l....attributes...majline.0...../..l....attributes...subslot...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d925e2e8  .............attributes...tottime.42........KH...attributes...testtime.........3....uuttype.0.............attributes...runtime..............attributes...procgid.73-9341-021417817........3....area.PASTE.............attributes...test.PASSED........3....passf...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7be7473f0  .....=..Jh.........=...(...attributes...runtime.f6f1298f-830f-47f4-b1dd-1adb07b99ff9653.....=..*(...attributes...numunits.1.....=..Jh...server......=.._....attributes...testr3name.RCDN9HQPROD1\RCDN9HQPROD1CPPVersion:3.6.2803.0.....=..*(...attributes...test...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7be847800  .........0...passfail.P........ (...attributes...bflush.0.......w.....attributes...tottime.1161.....A.b.(...area.SYSVF.......kH....attributes...test..............attributes...runtime.PASSED.....A.zl....attributes...procgid.2.........0...container.............|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7be949070  .....=...0...attributes...pcid......=..>....attributes...cpptimeid.6cc40f78-9525-4488-909f-2247d9537cf82013-04-04
19:24:23.....=.Z.....attributes...runtime.0.....=...0...attributes...testr3name.CPPVersion:3.6.2803.0.....=.............=...0...attributes...p...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7bea4a8e0  .....>{A0....uuttype.FJZPROD1\FJZPROD1.....>z.Mp...attributes...pcid......Ct..(...attributes...lineid......=..n8...container......=.oE..........4B......machine.F2049802CBLSTB-4044066-K9fxhmcekit2.....=.p(h...attributes...proctime......>z..`...machine.........|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7beb4a8f0  .....A../....attributes...partnum2......B'......attributes...username....D.....B'.L....area.f1303257.....A...P...area.74-8071-01F118190965553.....A.......server.PCBDLSYSPM.......$.....attributes...runtime......>{r+....attributes...cpptimeid......B.B.
...co...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7bec4a900  .....=..;x...attributes...username.tczpawe73-100074-01.....=..;x...attributes...slot.0.....=.......area.ASSY.....=..;x...attributes...lineid......=.......passfail.0P.....=.......container..........=..;x...attributes...numunits.1.....=.......attributes...pa...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7bed55ff0  .....=oC/....machine......"..Q....uuttype......"...`...attributes...parentsernum.73-8479-02FCZ133171DPfxcestgfqa1....."`..x...attributes...test......=l.2p...attributes...tenum3.8242009070919300730FOC13283D6A....."..Q....area......"...(...attributes...bflus...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x61cf45088  .....CSaL....server.FXCPROD1\FXCPROD1.....CW......passfail.P.....CSr.....attributes...runtime.50.....CSaL....machine.foxchict217.....CW......container..........CSaL....uuttype......CW......attributes...username.73-13315-03xzhang.....CSr.....attributes...te...|
   1,048,592 |     1,048,592 |      0.01%
> |  |  '- Total: 25 of 166,289 entries; 166,264 more                                 
                                                                                         
                                                                                         
                                |              |               |           
> |  |- java.util.concurrent.ConcurrentSkipListMap$EntrySet @ 0x72541dc58             
                                                                                         
                                                                                         
                                |           16 |            16 |      0.00%
> |  '- Total: 2 entries                                                              
                                                                                         
                                                                                         
                                |              |               |           
> |- org.github.jamm.MemoryMeter @ 0x72541db50                                        
                                                                                         
                                                                                         
                                |           24 |            40 |      0.00%
> |- java.util.concurrent.atomic.AtomicLong @ 0x72541db68                             
                                                                                         
                                                                                         
                                |           24 |            24 |      0.00%
> |- java.util.concurrent.atomic.AtomicLong @ 0x72541db80                             
                                                                                         
                                                                                         
                                |           24 |            24 |      0.00%
> |- java.util.concurrent.atomic.AtomicLong @ 0x72541db38                             
                                                                                         
                                                                                         
                                |           24 |            24 |      0.00%
> '- Total: 5 entries                                                                 
                                                                                         
                                                                                         
                                |              |               |           
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> {noformat}
> They are using the defaults at cassandra.yaml which means sstables should not use that
much heap.  Setting the following have been of no use:
> {noformat}
> memtable_total_space_in_mb: 2000
> memtable_flush_queue_size: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message