cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From graham sanderson <gra...@vast.com>
Subject Re: Hinted Handoff/Atomic Updates & SnapTree replacement with BTree
Date Mon, 14 Jul 2014 17:22:29 GMT
Thanks, I will open two JIRA issues for discussion

1) Partitioning of hints - this makes sense anyway (there can be LOTS of them in the same partition, either sort of tree has increased per element (rebalancing?) overhead as the number of elements grows… and this is the only use case I know of that causes really serious contention in e memtable.put/resolve code path, though clearly it could happen to some degree with any similar table). Whilst the code for this hint change is pretty straightforward (only mildly more complicated if we want to continue to deliver hints in global sorted timestamp uuid order for the node), I’m not sure what hoops we need to jump thru to make upgrade not a pain (with a system.hints schema change)
2) The contention itself - maybe for now we can just record contention for StatusLogger
> This discussion is probably better had on JIRA, but I favour an
> asynchronous approach permitting only one modifying thread access to the
> structure at a time, with each competing modification simply chaining their
> to-be-merged state to a pending-list, which is repaired at read time if it
> isn't merged before then. i.e. each writer attempts to take an exclusive
> lock on modification, and if it succeeds it merges its own state and any
> other pending state it sees on entry; if it fails to take the lock it
> appends the unmerged modification to a simple list, and returns success.

I thought about this, but assumed the caller thread had to guarantee that the work had been completed

> I plan to address this, but probably not until after 3.0, so if you're in a
> rush it may be worth exploring one of these alternative approaches. The
> simplest, least modifying solution, is probably to use unsafe to acquire
> the object monitor if we fail the first cas (but continue to execute the
> same codepath), so that we simply gate the amount of contention we can
> experience without incurring any extra costs in the common case.

Yeah I had thought to use Unsafe for the purposes of trying to acquire the object monitor… but I wasn’t sure we could rely on having it; whatever you do, once you acquire something, you need a try/finally, which I don’t know the exact costs of at the JVM level. I’ll open the issue and post my patch (which doesn’t add any cost in the (recently) uncontended path (well a read of a (new so tiny cost) non volatile object reference field inside the Holder which you have just read via AtomicReference.get) where we can critique it/propose alternatives.

On Jul 14, 2014, at 6:52 AM, Benedict Elliott Smith <belliottsmith@datastax.com> wrote:

> This discussion is probably better had on JIRA, but I favour an
> asynchronous approach permitting only one modifying thread access to the
> structure at a time, with each competing modification simply chaining their
> to-be-merged state to a pending-list, which is repaired at read time if it
> isn't merged before then. i.e. each writer attempts to take an exclusive
> lock on modification, and if it succeeds it merges its own state and any
> other pending state it sees on entry; if it fails to take the lock it
> appends the unmerged modification to a simple list, and returns success.
> 
> I plan to address this, but probably not until after 3.0, so if you're in a
> rush it may be worth exploring one of these alternative approaches. The
> simplest, least modifying solution, is probably to use unsafe to acquire
> the object monitor if we fail the first cas (but continue to execute the
> same codepath), so that we simply gate the amount of contention we can
> experience without incurring any extra costs in the common case.
> 
> 
> On Mon, Jul 14, 2014 at 8:10 AM, graham sanderson <graham@vast.com> wrote:
> 
>> Thanks - yeah I figured we’d want to use the original rowmutation
>> partition key as part of the hint partition key such that at least hints
>> for the same row come in the right order (it isn’t actually clear how much
>> that matters).
>> 
>> All that said, I delved a little deeper into
>> AtomicSortedColumns.addAllWithSizeDelta, and got some new numbers (note I
>> updated my test to actually call this method rather than simulate it -
>> since I wanted to tweak it - and even though I am still skipping some
>> copying of byte buffers, the memory overhead for the hinting is
>> significant).
>> 
>> The major wildcard problem however is that addAllWithSizeDelta is
>> effectively a spin lock with a fair bit of memory allocation in it…
>> 
>> It seems to me that the alternative patch would be to do extend the
>> “Holder” CAS state machine to approximately track concurrency, and fall
>> back down to a monitor under contention… I made such a patch on 2.0.x (I’ll
>> clean it up some and open an issue), and the results look pretty good:
>> 
>> This is the same sort of data as before, except for every thread_count,
>> element_count, element_size, partition_count tuple, I run with the original
>> version of addAllWithSizeDelta followed by the modified version.
>> 
>> Note the raw data size is now definitely a bit too small (just the size of
>> the UUID & the byte[] value), however relative numbers are interesting.
>> 
>> There are three levels of code:
>> 
>> “fast” - the original spin loop… thread coordination wise, it just reads
>> an AtomicReference, copies and updates into a new state and attempts to CAS
>> it back
>> “medium” - the same as fast, but inside a try/finally so that it can track
>> concurrency in an atomic integer instance (not available in fast path)
>> “slow” - the same as medium, but protected by the AtomicSortedColumns
>> instance’s monitor
>> 
>> the a/b number pairs below for these leves, are (a) number method
>> invocations that passed included that level, and (b) the number of loop
>> iterations in that level… If you look closely you’ll see the mix of levels
>> changing based on how much contention there is (note up/down are
>> approximate counts for such state changes for an individual
>> AtomicSortedColumns instance (aka partition in my test)
>> 
>> Here is the worst case scenario:
>> 
>>    [junit] Threads = 100 elements = 100000 (of size 64) partitions = 1
>>    [junit]   Duration = 1323ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 94 ms for 28 collections
>>    [junit]   Approx allocation = 9183MB vs 7MB; ratio to raw data size =
>> 1203.740583
>>    [junit]   fast 100000/1542949 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 1521ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 15 ms for 1 collections
>>    [junit]   Approx allocation = 550MB vs 7MB; ratio to raw data size =
>> 72.21061
>>    [junit]   fast 220/233 medium 3/3 slow 99779/99780 up 1 down 1
>> 
>> Note that the original code did approximately 15 loop iterations (spins)
>> per call - and ended up allocating about 9gig for 100000 64 byte values
>> 
>> The new code pretty much reverted to an object monitor in this case, and
>> only allocated about 550meg.
>> 
>> (In this case the new code was slower, but this generally seems to be
>> pretty much within the margin of error - I need a longer test to determine
>> that - there is really no reason it should be slower (other than I have
>> some debug counters in there)
>> 
>> Here’s another one picked at random with a mix of different levels (note
>> that the original code is slower because it does a bunch of memory
>> allocating spins - 190622 iterations for 1000000 inserts)
>> 
>>    [junit] Threads = 100 elements = 100000 (of size 256) partitions = 16
>>    [junit]   Duration = 216ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>    [junit]   Approx allocation = 772MB vs 25MB; ratio to raw data size =
>> 29.778822352941177
>>    [junit]   fast 100000/190622 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 215ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 27 ms for 2 collections
>>    [junit]   Approx allocation = 566MB vs 25MB; ratio to raw data size =
>> 21.856579705882353
>>    [junit]   fast 50873/67546 medium 18283/18919 slow 40230/47257 up
>> 10671 down 10339
>> 
>> Here are the full results
>> 
>>    [junit] Threads = 1 elements = 100000 (of size 64) partitions = 1
>>    [junit]   Duration = 715ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 25 ms for 3 collections
>>    [junit]   Approx allocation = 554MB vs 7MB; ratio to raw data size =
>> 72.644926
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 364ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 32 ms for 3 collections
>>    [junit]   Approx allocation = 566MB vs 7MB; ratio to raw data size =
>> 74.21657
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 64) partitions = 16
>>    [junit]   Duration = 296ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 18 ms for 2 collections
>>    [junit]   Approx allocation = 464MB vs 7MB; ratio to raw data size =
>> 60.853517
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 334ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 19 ms for 2 collections
>>    [junit]   Approx allocation = 442MB vs 7MB; ratio to raw data size =
>> 57.934483
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 64) partitions = 256
>>    [junit]   Duration = 250ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 9 ms for 1 collections
>>    [junit]   Approx allocation = 329MB vs 7MB; ratio to raw data size =
>> 43.249159
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 238ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 9 ms for 1 collections
>>    [junit]   Approx allocation = 327MB vs 7MB; ratio to raw data size =
>> 42.943091
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 64) partitions = 1024
>>    [junit]   Duration = 225ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>    [junit]   Approx allocation = 276MB vs 7MB; ratio to raw data size =
>> 36.303192
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 231ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>    [junit]   Approx allocation = 274MB vs 7MB; ratio to raw data size =
>> 36.043359
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 64) partitions = 1
>>    [junit]   Duration = 1323ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 94 ms for 28 collections
>>    [junit]   Approx allocation = 9183MB vs 7MB; ratio to raw data size =
>> 1203.740583
>>    [junit]   fast 100000/1542949 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 1521ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 15 ms for 1 collections
>>    [junit]   Approx allocation = 550MB vs 7MB; ratio to raw data size =
>> 72.21061
>>    [junit]   fast 220/233 medium 3/3 slow 99779/99780 up 1 down 1
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 64) partitions = 16
>>    [junit]   Duration = 218ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 24 ms for 2 collections
>>    [junit]   Approx allocation = 739MB vs 7MB; ratio to raw data size =
>> 96.934667
>>    [junit]   fast 100000/182402 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 195ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 9 ms for 1 collections
>>    [junit]   Approx allocation = 567MB vs 7MB; ratio to raw data size =
>> 74.399013
>>    [junit]   fast 50466/67007 medium 18078/18750 slow 40781/47692 up
>> 10574 down 10249
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 64) partitions = 256
>>    [junit]   Duration = 170ms maxConcurrency = 98
>>    [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>    [junit]   Approx allocation = 341MB vs 7MB; ratio to raw data size =
>> 44.826425
>>    [junit]   fast 100000/102082 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 174ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>    [junit]   Approx allocation = 344MB vs 7MB; ratio to raw data size =
>> 45.096833
>>    [junit]   fast 98009/100008 medium 2003/2006 slow 51/94 up 1964 down
>> 1958
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 64) partitions = 1024
>>    [junit]   Duration = 168ms maxConcurrency = 95
>>    [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>    [junit]   Approx allocation = 283MB vs 7MB; ratio to raw data size =
>> 37.095636
>>    [junit]   fast 100000/100427 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 177ms maxConcurrency = 98
>>    [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>    [junit]   Approx allocation = 292MB vs 7MB; ratio to raw data size =
>> 38.384233
>>    [junit]   fast 99596/100001 medium 404/404 slow 0/0 up 405 down 404
>>    [junit]
>>    [junit] --------------------------------------------------
>>    [junit] Threads = 1 elements = 100000 (of size 256) partitions = 1
>>    [junit]   Duration = 793ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 23 ms for 2 collections
>>    [junit]   Approx allocation = 558MB vs 25MB; ratio to raw data size =
>> 21.542246176470588
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 351ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 22 ms for 2 collections
>>    [junit]   Approx allocation = 566MB vs 25MB; ratio to raw data size =
>> 21.824048823529413
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 256) partitions = 16
>>    [junit]   Duration = 316ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 8 ms for 1 collections
>>    [junit]   Approx allocation = 456MB vs 25MB; ratio to raw data size =
>> 17.608017352941175
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 789ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 12 ms for 1 collections
>>    [junit]   Approx allocation = 462MB vs 25MB; ratio to raw data size =
>> 17.84756294117647
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 256) partitions = 256
>>    [junit]   Duration = 612ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>    [junit]   Approx allocation = 349MB vs 25MB; ratio to raw data size =
>> 13.467229411764706
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 495ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>    [junit]   Approx allocation = 351MB vs 25MB; ratio to raw data size =
>> 13.550963235294118
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 256) partitions = 1024
>>    [junit]   Duration = 422ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>    [junit]   Approx allocation = 292MB vs 25MB; ratio to raw data size =
>> 11.267923823529411
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 405ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 12 ms for 1 collections
>>    [junit]   Approx allocation = 294MB vs 25MB; ratio to raw data size =
>> 11.371588235294118
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 256) partitions = 1
>>    [junit]   Duration = 2197ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 122 ms for 32 collections
>>    [junit]   Approx allocation = 10317MB vs 25MB; ratio to raw data size
>> = 397.75221411764704
>>    [junit]   fast 100000/1777796 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 1253ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 17 ms for 1 collections
>>    [junit]   Approx allocation = 565MB vs 25MB; ratio to raw data size =
>> 21.804268823529412
>>    [junit]   fast 582/585 medium 3/3 slow 99417/99418 up 1 down 1
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 256) partitions = 16
>>    [junit]   Duration = 216ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>    [junit]   Approx allocation = 772MB vs 25MB; ratio to raw data size =
>> 29.778822352941177
>>    [junit]   fast 100000/190622 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 215ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 27 ms for 2 collections
>>    [junit]   Approx allocation = 566MB vs 25MB; ratio to raw data size =
>> 21.856579705882353
>>    [junit]   fast 50873/67546 medium 18283/18919 slow 40230/47257 up
>> 10671 down 10339
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 256) partitions = 256
>>    [junit]   Duration = 187ms maxConcurrency = 97
>>    [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>    [junit]   Approx allocation = 345MB vs 25MB; ratio to raw data size =
>> 13.314321470588235
>>    [junit]   fast 100000/101946 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 183ms maxConcurrency = 98
>>    [junit]   GC for PS Scavenge: 12 ms for 1 collections
>>    [junit]   Approx allocation = 354MB vs 25MB; ratio to raw data size =
>> 13.662984705882353
>>    [junit]   fast 98108/100024 medium 1911/1914 slow 32/61 up 1887 down
>> 1882
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 256) partitions = 1024
>>    [junit]   Duration = 185ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>    [junit]   Approx allocation = 293MB vs 25MB; ratio to raw data size =
>> 11.313215294117647
>>    [junit]   fast 100000/100399 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 176ms maxConcurrency = 94
>>    [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>    [junit]   Approx allocation = 294MB vs 25MB; ratio to raw data size =
>> 11.350447647058823
>>    [junit]   fast 99573/100000 medium 426/426 slow 2/4 up 426 down 424
>>    [junit]
>>    [junit] --------------------------------------------------
>>    [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 1
>>    [junit]   Duration = 993ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 29 ms for 2 collections
>>    [junit]   Approx allocation = 628MB vs 99MB; ratio to raw data size =
>> 6.339322461538462
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 797ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 40 ms for 3 collections
>>    [junit]   Approx allocation = 623MB vs 99MB; ratio to raw data size =
>> 6.285295538461538
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 16
>>    [junit]   Duration = 886ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 29 ms for 2 collections
>>    [junit]   Approx allocation = 508MB vs 99MB; ratio to raw data size =
>> 5.130343307692308
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 820ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 28 ms for 2 collections
>>    [junit]   Approx allocation = 535MB vs 99MB; ratio to raw data size =
>> 5.398294076923077
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 256
>>    [junit]   Duration = 594ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 28 ms for 2 collections
>>    [junit]   Approx allocation = 392MB vs 99MB; ratio to raw data size =
>> 3.9560997692307693
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 484ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 30 ms for 2 collections
>>    [junit]   Approx allocation = 410MB vs 99MB; ratio to raw data size =
>> 4.141943
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 1024
>>    [junit]   Duration = 401ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 20 ms for 2 collections
>>    [junit]   Approx allocation = 474MB vs 99MB; ratio to raw data size =
>> 4.779102
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 405ms maxConcurrency = 1
>>    [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>    [junit]   Approx allocation = 367MB vs 99MB; ratio to raw data size =
>> 3.7006123846153844
>>    [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 1024) partitions = 1
>>    [junit]   Duration = 2383ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 144 ms for 37 collections
>>    [junit]   Approx allocation = 11728MB vs 99MB; ratio to raw data size
>> = 118.24829784615385
>>    [junit]   fast 100000/1781388 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 1298ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 54 ms for 2 collections
>>    [junit]   Approx allocation = 490MB vs 99MB; ratio to raw data size =
>> 4.945822
>>    [junit]   fast 312/314 medium 2/2 slow 99687/99688 up 1 down 1
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 1024) partitions = 16
>>    [junit]   Duration = 231ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 38 ms for 3 collections
>>    [junit]   Approx allocation = 856MB vs 99MB; ratio to raw data size =
>> 8.638853769230769
>>    [junit]   fast 100000/189676 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 237ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 40 ms for 3 collections
>>    [junit]   Approx allocation = 634MB vs 99MB; ratio to raw data size =
>> 6.397918692307693
>>    [junit]   fast 55642/73346 medium 19132/19867 slow 34655/41616 up
>> 11578 down 11266
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 1024) partitions = 256
>>    [junit]   Duration = 212ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 38 ms for 2 collections
>>    [junit]   Approx allocation = 411MB vs 99MB; ratio to raw data size =
>> 4.148154692307692
>>    [junit]   fast 100000/101956 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 209ms maxConcurrency = 99
>>    [junit]   GC for PS Scavenge: 29 ms for 2 collections
>>    [junit]   Approx allocation = 408MB vs 99MB; ratio to raw data size =
>> 4.121041076923077
>>    [junit]   fast 98133/100004 medium 1876/1879 slow 39/73 up 1845 down
>> 1842
>>    [junit]
>>    [junit] Threads = 100 elements = 100000 (of size 1024) partitions =
>> 1024
>>    [junit]   Duration = 203ms maxConcurrency = 97
>>    [junit]   GC for PS Scavenge: 30 ms for 2 collections
>>    [junit]   Approx allocation = 368MB vs 99MB; ratio to raw data size =
>> 3.7190386923076924
>>    [junit]   fast 100000/100417 medium 0/0 slow 0/0 up 0 down 0
>>    [junit]
>>    [junit]  -
>>    [junit]   Duration = 201ms maxConcurrency = 100
>>    [junit]   GC for PS Scavenge: 31 ms for 2 collections
>>    [junit]   Approx allocation = 360MB vs 99MB; ratio to raw data size =
>> 3.632088230769231
>>    [junit]   fast 99611/100000 medium 389/389 slow 1/2 up 389 down 388
>>    [junit]
>> 
>> On Jul 13, 2014, at 10:16 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>> 
>> Thanks for the analysis.  It should be pretty straightforward to
>> assign hints to one of 10 random partitions, then handoff can just
>> scan for each partition key.  Small penalty on handoff to mitigate
>> your problem.
>> 
>> On Sun, Jul 13, 2014 at 4:25 PM, graham sanderson <graham@vast.com> wrote:
>> 
>> Here are some numbers from latest 2.1 (on the production level h/w)
>> 
>> BTree is better off than was SnapTreeMap (note also that I allocated my
>> cells on heap, however all of the extra allocation under race conditions
>> itself is actually the Object[], so is going to be on heap anyway):
>> - Generally the data structure overhead is much smaller
>> - The contention cost drops off much quicker with number of partitions
>> (i.e.
>> 16 is actually pretty fine, though if you’re going with >1 then 256 is not
>> much harder than 16 unless you want to use a single IN clause).
>> 
>> However it still does suffer when a huge amount of almost sorted data is
>> inserted very fast as in the hints case (note I tried with random
>> timestamps
>> out of curiosity, and it was somewhat better as expected, but not
>> significantly).
>> 
>> So the moral of the story seems to be that we probably shouldn’t be writing
>> hints for a node to a single partition… (ironically lots of small inserts
>> means that the write rate is faster, hence more concurrency of hinting, and
>> a mere 100M of inserted data ends up allocating 10s of gigs of RAM on the
>> node during the hinting process) any thoughts appreciated, but I’ll likely
>> move thread to a ticket shortly.
>> 
>> Note, finally that with SnapTreeMap a lot of extra rapidly allocated data
>> was live long enough to spill into the old gen, causing eventual promotion
>> failure of memtables on the hinting nodes too, that may not be the case
>> with
>> BTree on 2.1, and of course off heap memtables will help this issue too.
>> 
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 1129ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 36 ms for 2 collections
>>   [junit]   Approx allocation = 249MB vs 38MB; ratio to raw data size =
>> 6.5352668
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>>   [junit]   Duration = 349ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>   [junit]   Approx allocation = 197MB vs 38MB; ratio to raw data size =
>> 5.1714
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>>   [junit]   Duration = 835ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 19 ms for 1 collections
>>   [junit]   Approx allocation = 147MB vs 38MB; ratio to raw data size =
>> 3.8699954
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 635ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 23 ms for 1 collections
>>   [junit]   Approx allocation = 118MB vs 38MB; ratio to raw data size =
>> 3.1053228
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 2079ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 130 ms for 24 collections
>>   [junit]   Approx allocation = 6484MB vs 38MB; ratio to raw data size =
>> 169.9863816
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>>   [junit]   Duration = 363ms maxConcurrency = 100
>>   [junit]   Approx allocation = 238MB vs 38MB; ratio to raw data size =
>> 6.2529482
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>>   [junit]   Duration = 370ms maxConcurrency = 96
>>   [junit]   Approx allocation = 154MB vs 38MB; ratio to raw data size =
>> 4.055179
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 353ms maxConcurrency = 100
>>   [junit]   Approx allocation = 124MB vs 38MB; ratio to raw data size =
>> 3.2727604
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>>   [junit]   Duration = 1055ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 19 ms for 1 collections
>>   [junit]   Approx allocation = 237MB vs 83MB; ratio to raw data size =
>> 2.831478363636364
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>>   [junit]   Duration = 667ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 16 ms for 1 collections
>>   [junit]   Approx allocation = 235MB vs 83MB; ratio to raw data size =
>> 2.802512909090909
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>>   [junit]   Duration = 606ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 19 ms for 1 collections
>>   [junit]   Approx allocation = 175MB vs 83MB; ratio to raw data size =
>> 2.091785727272727
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>>   [junit]   Duration = 268ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>   [junit]   Approx allocation = 149MB vs 83MB; ratio to raw data size =
>> 1.7858322727272726
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>>   [junit]   Duration = 2120ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 143 ms for 23 collections
>>   [junit]   Approx allocation = 6248MB vs 83MB; ratio to raw data size =
>> 74.454192
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>>   [junit]   Duration = 310ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 33 ms for 1 collections
>>   [junit]   Approx allocation = 236MB vs 83MB; ratio to raw data size =
>> 2.817435
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>>   [junit]   Duration = 383ms maxConcurrency = 99
>>   [junit]   Approx allocation = 199MB vs 83MB; ratio to raw data size =
>> 2.3777502727272726
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1024
>>   [junit]   Duration = 387ms maxConcurrency = 98
>>   [junit]   GC for PS Scavenge: 25 ms for 1 collections
>>   [junit]   Approx allocation = 176MB vs 83MB; ratio to raw data size =
>> 2.1082142727272726
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>>   [junit]   Duration = 1230ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 64 ms for 3 collections
>>   [junit]   Approx allocation = 457MB vs 267MB; ratio to raw data size =
>> 1.711579
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>>   [junit]   Duration = 898ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 58 ms for 3 collections
>>   [junit]   Approx allocation = 427MB vs 267MB; ratio to raw data size =
>> 1.6025206571428572
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>>   [junit]   Duration = 944ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 72 ms for 3 collections
>>   [junit]   Approx allocation = 377MB vs 267MB; ratio to raw data size =
>> 1.4138783714285714
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>>   [junit]   Duration = 877ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 78 ms for 3 collections
>>   [junit]   Approx allocation = 347MB vs 267MB; ratio to raw data size =
>> 1.3008453714285715
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>>   [junit]   Duration = 2073ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 166 ms for 28 collections
>>   [junit]   Approx allocation = 7052MB vs 267MB; ratio to raw data size =
>> 26.410422285714287
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>>   [junit]   Duration = 398ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 45 ms for 2 collections
>>   [junit]   Approx allocation = 375MB vs 267MB; ratio to raw data size =
>> 1.4073654
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 256
>>   [junit]   Duration = 448ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 71 ms for 3 collections
>>   [junit]   Approx allocation = 369MB vs 267MB; ratio to raw data size =
>> 1.3841326
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1024
>>   [junit]   Duration = 448ms maxConcurrency = 99
>>   [junit]   GC for PS Scavenge: 77 ms for 3 collections
>>   [junit]   Approx allocation = 354MB vs 267MB; ratio to raw data size =
>> 1.3269438571428571
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] ==================================================
>> 
>> On Jul 13, 2014, at 11:34 AM, graham sanderson <graham@vast.com> wrote:
>> 
>> OK, some numbers from my test:
>> 
>> Basically the test mimics the 2.0.x write path for a hint -
>> there are P partitions (each has an AtomicReference to a SnapTreeMap sorted
>> by timeuuid comparator)
>> there are N threads
>> “elements” is the total number of timeUUID/value pairs (aka columns)
>> each thread inserts the same number of elements (“elements”/N)
>> inserting the element is done by getting the atomic reference, cloning the
>> SnapTreeMap, adding the element, and trying to CAS the updated reference
>> back.
>> the value of the element in this case is a byte[] of the size indicated
>> allocated by the current thread
>> 
>> the test counts approximately the amount of eden allocation… the ratio
>> shown
>> is how much was allocated vs the rough size of the raw data to be inserted.
>> 
>> This test is a worst case scenario in so far as there is no sleeping
>> (though
>> on my laptop as you can see from maxConcurrency - which is the maximum
>> number of threads in the element loop at any point, I fail to hit it that
>> very hard, whereas one of our production class machines can maintain 100
>> thread concurrency)
>> 
>> Full results are at the bottom (note I ran with 1 thread an 100 threads,
>> element sizes 64, 256, 1024 bytes, and 1, 16, 256, 1024 partitions)
>> 
>> Here are some worst case examples (obviously the ratio is higher for even
>> high partition counts here, because the size of even a single tree
>> structure
>> becomes significant compared to the size of the data stored):
>> 
>> 
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 3922ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 101 ms for 28 collections
>>   [junit]   Approx allocation = 8691MB vs 38MB; ratio to raw data size =
>> 227.833744
>>   [junit]
>> 
>> vs
>> 
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 399ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 21 ms for 1 collections
>>   [junit]   Approx allocation = 320MB vs 38MB; ratio to raw data size =
>> 8.4010366
>> 
>> Note, I’ll try the same test on 2.1 against the new write path.
>> 
>> =====
>> 
>> FYI insertion of one element
>> 
>>                       AtomicReference<SnapTreeMap<ByteBuffer, byte[]>> ref
>> = maps.get(random.nextInt(partitionCount));
>>                       SnapTreeMap<ByteBuffer, byte[]> current, modified;
>>                       do
>>                       {
>>                           current = ref.get();
>>                           modified = current.clone();
>>                           if (null != modified.putIfAbsent(key, value))
>>                           {
>>                               modified.replace(key, value);
>>                           }
>>                       } while (!ref.compareAndSet(current, modified));
>> 
>> 
>> Full results from my laptop
>> 
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 499ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 29 ms for 4 collections
>>   [junit]   Approx allocation = 539MB vs 38MB; ratio to raw data size =
>> 14.1319278
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>>   [junit]   Duration = 445ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 26 ms for 3 collections
>>   [junit]   Approx allocation = 446MB vs 38MB; ratio to raw data size =
>> 11.7168844
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>>   [junit]   Duration = 395ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 20 ms for 2 collections
>>   [junit]   Approx allocation = 370MB vs 38MB; ratio to raw data size =
>> 9.7106606
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 399ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 24 ms for 2 collections
>>   [junit]   Approx allocation = 311MB vs 38MB; ratio to raw data size =
>> 8.1768922
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 1143ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 148 ms for 18 collections
>>   [junit]   Approx allocation = 3853MB vs 38MB; ratio to raw data size =
>> 101.0072052
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>>   [junit]   Duration = 171ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 22 ms for 2 collections
>>   [junit]   Approx allocation = 605MB vs 38MB; ratio to raw data size =
>> 15.8683486
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>>   [junit]   Duration = 8ms maxConcurrency = 12
>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>   [junit]   Approx allocation = 370MB vs 38MB; ratio to raw data size =
>> 9.713281
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 3ms maxConcurrency = 9
>>   [junit]   GC for PS Scavenge: 20 ms for 1 collections
>>   [junit]   Approx allocation = 327MB vs 38MB; ratio to raw data size =
>> 8.5870928
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>>   [junit]   Duration = 492ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>   [junit]   Approx allocation = 557MB vs 83MB; ratio to raw data size =
>> 6.638187636363637
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>>   [junit]   Duration = 432ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 26 ms for 2 collections
>>   [junit]   Approx allocation = 470MB vs 83MB; ratio to raw data size =
>> 5.603587909090909
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>>   [junit]   Duration = 430ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 38 ms for 3 collections
>>   [junit]   Approx allocation = 406MB vs 83MB; ratio to raw data size =
>> 4.845429545454546
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>>   [junit]   Duration = 433ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 41 ms for 3 collections
>>   [junit]   Approx allocation = 355MB vs 83MB; ratio to raw data size =
>> 4.231936636363637
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>>   [junit]   Duration = 1154ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 163 ms for 25 collections
>>   [junit]   Approx allocation = 4344MB vs 83MB; ratio to raw data size =
>> 51.769667727272726
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>>   [junit]   Duration = 279ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 61 ms for 4 collections
>>   [junit]   Approx allocation = 506MB vs 83MB; ratio to raw data size =
>> 6.037446818181818
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>>   [junit]   Duration = 6ms maxConcurrency = 11
>>   [junit]   GC for PS Scavenge: 36 ms for 3 collections
>>   [junit]   Approx allocation = 402MB vs 83MB; ratio to raw data size =
>> 4.795717636363636
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1024
>>   [junit]   Duration = 5ms maxConcurrency = 38
>>   [junit]   GC for PS Scavenge: 46 ms for 3 collections
>>   [junit]   Approx allocation = 362MB vs 83MB; ratio to raw data size =
>> 4.3233903636363635
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>>   [junit]   Duration = 595ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 109 ms for 6 collections
>>   [junit]   Approx allocation = 766MB vs 267MB; ratio to raw data size =
>> 2.868619457142857
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>>   [junit]   Duration = 503ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 66 ms for 5 collections
>>   [junit]   Approx allocation = 676MB vs 267MB; ratio to raw data size =
>> 2.531949057142857
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>>   [junit]   Duration = 498ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 78 ms for 5 collections
>>   [junit]   Approx allocation = 583MB vs 267MB; ratio to raw data size =
>> 2.1853203714285714
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>>   [junit]   Duration = 473ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 86 ms for 4 collections
>>   [junit]   Approx allocation = 537MB vs 267MB; ratio to raw data size =
>> 2.012229685714286
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>>   [junit]   Duration = 1338ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 197 ms for 30 collections
>>   [junit]   Approx allocation = 3883MB vs 267MB; ratio to raw data size =
>> 14.5419178
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>>   [junit]   Duration = 278ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 128 ms for 6 collections
>>   [junit]   Approx allocation = 752MB vs 267MB; ratio to raw data size =
>> 2.817227914285714
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 256
>>   [junit]   Duration = 80ms maxConcurrency = 42
>>   [junit]   GC for PS Scavenge: 99 ms for 5 collections
>>   [junit]   Approx allocation = 596MB vs 267MB; ratio to raw data size =
>> 2.2341639714285715
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1024
>>   [junit]   Duration = 4ms maxConcurrency = 26
>>   [junit]   GC for PS Scavenge: 76 ms for 4 collections
>>   [junit]   Approx allocation = 547MB vs 267MB; ratio to raw data size =
>> 2.0490342285714287
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] ==================================================
>> 
>> Full results from a 16 core (32 thread) machine
>> 
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 1431ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 52 ms for 4 collections
>>   [junit]   Approx allocation = 572MB vs 38MB; ratio to raw data size =
>> 15.0195638
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>>   [junit]   Duration = 1216ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 46 ms for 3 collections
>>   [junit]   Approx allocation = 491MB vs 38MB; ratio to raw data size =
>> 12.8950428
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>>   [junit]   Duration = 899ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 22 ms for 2 collections
>>   [junit]   Approx allocation = 458MB vs 38MB; ratio to raw data size =
>> 12.0219362
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 546ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 10 ms for 1 collections
>>   [junit]   Approx allocation = 309MB vs 38MB; ratio to raw data size =
>> 8.1213414
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>   [junit]   Duration = 3922ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 101 ms for 28 collections
>>   [junit]   Approx allocation = 8691MB vs 38MB; ratio to raw data size =
>> 227.833744
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>>   [junit]   Duration = 255ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 18 ms for 2 collections
>>   [junit]   Approx allocation = 654MB vs 38MB; ratio to raw data size =
>> 17.1485322
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>>   [junit]   Duration = 225ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>   [junit]   Approx allocation = 368MB vs 38MB; ratio to raw data size =
>> 9.6528628
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>   [junit]   Duration = 399ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 21 ms for 1 collections
>>   [junit]   Approx allocation = 320MB vs 38MB; ratio to raw data size =
>> 8.4010366
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>>   [junit]   Duration = 1350ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 31 ms for 2 collections
>>   [junit]   Approx allocation = 462MB vs 83MB; ratio to raw data size =
>> 5.509877818181818
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>>   [junit]   Duration = 861ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 28 ms for 2 collections
>>   [junit]   Approx allocation = 491MB vs 83MB; ratio to raw data size =
>> 5.854580272727273
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>>   [junit]   Duration = 445ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 10 ms for 1 collections
>>   [junit]   Approx allocation = 399MB vs 83MB; ratio to raw data size =
>> 4.764038727272728
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>>   [junit]   Duration = 403ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 10 ms for 1 collections
>>   [junit]   Approx allocation = 356MB vs 83MB; ratio to raw data size =
>> 4.242789363636364
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>>   [junit]   Duration = 3530ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 104 ms for 30 collections
>>   [junit]   Approx allocation = 9261MB vs 83MB; ratio to raw data size =
>> 110.35432245454545
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>>   [junit]   Duration = 231ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 17 ms for 2 collections
>>   [junit]   Approx allocation = 722MB vs 83MB; ratio to raw data size =
>> 8.607574
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>>   [junit]   Duration = 205ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>   [junit]   Approx allocation = 411MB vs 83MB; ratio to raw data size =
>> 4.901291545454545
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1024
>>   [junit]   Duration = 343ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>   [junit]   Approx allocation = 362MB vs 83MB; ratio to raw data size =
>> 4.321081454545454
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>>   [junit]   Duration = 1039ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 63 ms for 5 collections
>>   [junit]   Approx allocation = 597MB vs 267MB; ratio to raw data size =
>> 2.238462485714286
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>>   [junit]   Duration = 1285ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 50 ms for 4 collections
>>   [junit]   Approx allocation = 687MB vs 267MB; ratio to raw data size =
>> 2.5740599428571427
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>>   [junit]   Duration = 871ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 50 ms for 4 collections
>>   [junit]   Approx allocation = 576MB vs 267MB; ratio to raw data size =
>> 2.158096
>>   [junit]
>>   [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>>   [junit]   Duration = 624ms maxConcurrency = 1
>>   [junit]   GC for PS Scavenge: 47 ms for 4 collections
>>   [junit]   Approx allocation = 531MB vs 267MB; ratio to raw data size =
>> 1.9889662
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>>   [junit]   Duration = 3204ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 121 ms for 35 collections
>>   [junit]   Approx allocation = 10597MB vs 267MB; ratio to raw data size =
>> 39.68819474285714
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>>   [junit]   Duration = 265ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 30 ms for 3 collections
>>   [junit]   Approx allocation = 508MB vs 267MB; ratio to raw data size =
>> 1.9032105142857143
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 256
>>   [junit]   Duration = 323ms maxConcurrency = 100
>>   [junit]   GC for PS Scavenge: 46 ms for 4 collections
>>   [junit]   Approx allocation = 582MB vs 267MB; ratio to raw data size =
>> 2.180992342857143
>>   [junit]
>>   [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1024
>>   [junit]   Duration = 434ms maxConcurrency = 99
>>   [junit]   GC for PS Scavenge: 66 ms for 4 collections
>>   [junit]   Approx allocation = 544MB vs 267MB; ratio to raw data size =
>> 2.039445542857143
>>   [junit]
>>   [junit] --------------------------------------------------
>>   [junit] ==================================================
>> 
>> On Jul 12, 2014, at 9:48 PM, graham sanderson <graham@vast.com> wrote:
>> 
>> Running on 2.0.5 we’ve noticed occasional MASSIVE spikes in memory
>> allocation that happen under write load, but way out of proportion to the
>> amount of data actually being written.
>> 
>> My suspicion is the problem is related to hinted handoff, and basically
>> follows the “some sort of non trivial GC pause on one node (probably caused
>> by promotion failure of memtable due to slow/eventual fragmentation of
>> tenured gen) causes other nodes to start writing hints for that node, and
>> the hinting turns out to be actually much worse memory wise than the writes
>> themselves.
>> 
>> Obviously there is the JIRA issue in 3.0 to not store hints in a table, but
>> in the meanwhile I thought I’d take a quick peek.
>> 
>> I’m in the process of writing a test to confirm what I think is happening,
>> but I thought I’d throw this out there anyway for discussion in the context
>> of 2.1 changes.
>> 
>> ====
>> 
>> In this case, we are adding a single column (of small delta data) to a lot
>> of partitions at once (lets say a hundred thousand partitions for each node
>> after replication) pretty fast with not a lot of data for each probably a
>> few hundred bytes max… this really puts very little strain on things at
>> all.
>> However if we are hinting, then ALL these updates for one target node end
>> up
>> going to the same partition in system.hints. My assumption that this
>> combined with SnapTree is a huge problem. Whilst the snap tree clone itself
>> may be quick, we are essentially appending each time, so there is probably
>> a
>> huge amount of rebalancing going on along with the lazy copy on write and
>> the inherent concurrent race waste. I am guestimating that we are actually
>> allocating several orders of magnitude more memory than doing the actual
>> regular inserts does. The fact that our H/W can handle a very very high
>> concurrent write volume probably makes the problem much worse.
>> 
>> So my first thought, was fine, why not just change hinted handoff to shard
>> (an extra partition key column along with the existing node uuid) based on
>> say a fixed length 8 bit hash of the original row mutation partition key.
>> This would spread the load some and probably solve the problem, but would
>> require a schema change to system.hints which is annoying. That said,
>> appending (sorted) data concurrently to the same partition is the sort of
>> thing that many people might do, though probably not as fast as is done in
>> this hinting case, so probably needs to be better anyway.
>> 
>> So searching, I came across
>> https://issues.apache.org/jira/browse/CASSANDRA-6271 for 2.1 which
>> replaces
>> SnapTree… so my final question is (and I will try to test this on 2.1
>> soon).. do we expect that this change will probably make the hint problem
>> moot anyways? i.e. heavy concurrent write to the same partition key is AOK,
>> or does it make sense to shard hints until 3.x makes everything groovy.
>> 
>> Thanks,
>> 
>> Graham.
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>> 
>> 
>> 


Mime
View raw message