cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graham Sanderson <gra...@vast.com>
Subject Re: Hinted Handoff/Atomic Updates & SnapTree replacement with BTree
Date Mon, 14 Jul 2014 20:47:32 GMT
Cool thanks guys - based on Jonathan's response I'll just file the one issue with my 2.0.x patch to minimize cas fails

Sent from my iPhone

On Jul 14, 2014, at 2:18 PM, Benedict Elliott Smith <belliottsmith@datastax.com> wrote:

>> 
>> I thought about this, but assumed the caller thread had to guarantee that
>> the work had been completed
> 
> 
> It does. This is why the readers would be required to repair any unmerged
> state (potentially performing the merge themselves if the lock is now
> available).
> 
> but I wasn’t sure we could rely on having it;
> 
> 
> If it's missing and we no-op instead we simply revert to the current
> behaviour, which seems acceptable to me.
> 
>  you need a try/finally, which I don’t know the exact costs of at the JVM
>> level
> 
> 
> This is definitely not something to worry about given the other costs
> incurred for these actions.
> 
> I’ll open the issue and post my patch ... where we can critique it/propose
>> alternatives.
> 
> 
> Sounds good
> 
> 
> 
> 
>> On Mon, Jul 14, 2014 at 6:22 PM, graham sanderson <graham@vast.com> wrote:
>> 
>> Thanks, I will open two JIRA issues for discussion
>> 
>> 1) Partitioning of hints - this makes sense anyway (there can be LOTS of
>> them in the same partition, either sort of tree has increased per element
>> (rebalancing?) overhead as the number of elements grows… and this is the
>> only use case I know of that causes really serious contention in e
>> memtable.put/resolve code path, though clearly it could happen to some
>> degree with any similar table). Whilst the code for this hint change is
>> pretty straightforward (only mildly more complicated if we want to continue
>> to deliver hints in global sorted timestamp uuid order for the node), I’m
>> not sure what hoops we need to jump thru to make upgrade not a pain (with a
>> system.hints schema change)
>> 2) The contention itself - maybe for now we can just record contention for
>> StatusLogger
>>> This discussion is probably better had on JIRA, but I favour an
>>> asynchronous approach permitting only one modifying thread access to the
>>> structure at a time, with each competing modification simply chaining
>> their
>>> to-be-merged state to a pending-list, which is repaired at read time if
>> it
>>> isn't merged before then. i.e. each writer attempts to take an exclusive
>>> lock on modification, and if it succeeds it merges its own state and any
>>> other pending state it sees on entry; if it fails to take the lock it
>>> appends the unmerged modification to a simple list, and returns success.
>> 
>> I thought about this, but assumed the caller thread had to guarantee that
>> the work had been completed
>> 
>>> I plan to address this, but probably not until after 3.0, so if you're
>> in a
>>> rush it may be worth exploring one of these alternative approaches. The
>>> simplest, least modifying solution, is probably to use unsafe to acquire
>>> the object monitor if we fail the first cas (but continue to execute the
>>> same codepath), so that we simply gate the amount of contention we can
>>> experience without incurring any extra costs in the common case.
>> 
>> Yeah I had thought to use Unsafe for the purposes of trying to acquire the
>> object monitor… but I wasn’t sure we could rely on having it; whatever you
>> do, once you acquire something, you need a try/finally, which I don’t know
>> the exact costs of at the JVM level. I’ll open the issue and post my patch
>> (which doesn’t add any cost in the (recently) uncontended path (well a read
>> of a (new so tiny cost) non volatile object reference field inside the
>> Holder which you have just read via AtomicReference.get) where we can
>> critique it/propose alternatives.
>> 
>> On Jul 14, 2014, at 6:52 AM, Benedict Elliott Smith <
>> belliottsmith@datastax.com> wrote:
>> 
>>> This discussion is probably better had on JIRA, but I favour an
>>> asynchronous approach permitting only one modifying thread access to the
>>> structure at a time, with each competing modification simply chaining
>> their
>>> to-be-merged state to a pending-list, which is repaired at read time if
>> it
>>> isn't merged before then. i.e. each writer attempts to take an exclusive
>>> lock on modification, and if it succeeds it merges its own state and any
>>> other pending state it sees on entry; if it fails to take the lock it
>>> appends the unmerged modification to a simple list, and returns success.
>>> 
>>> I plan to address this, but probably not until after 3.0, so if you're
>> in a
>>> rush it may be worth exploring one of these alternative approaches. The
>>> simplest, least modifying solution, is probably to use unsafe to acquire
>>> the object monitor if we fail the first cas (but continue to execute the
>>> same codepath), so that we simply gate the amount of contention we can
>>> experience without incurring any extra costs in the common case.
>>> 
>>> 
>>>> On Mon, Jul 14, 2014 at 8:10 AM, graham sanderson <graham@vast.com>
>>> wrote:
>>> 
>>>> Thanks - yeah I figured we’d want to use the original rowmutation
>>>> partition key as part of the hint partition key such that at least hints
>>>> for the same row come in the right order (it isn’t actually clear how
>> much
>>>> that matters).
>>>> 
>>>> All that said, I delved a little deeper into
>>>> AtomicSortedColumns.addAllWithSizeDelta, and got some new numbers (note
>> I
>>>> updated my test to actually call this method rather than simulate it -
>>>> since I wanted to tweak it - and even though I am still skipping some
>>>> copying of byte buffers, the memory overhead for the hinting is
>>>> significant).
>>>> 
>>>> The major wildcard problem however is that addAllWithSizeDelta is
>>>> effectively a spin lock with a fair bit of memory allocation in it…
>>>> 
>>>> It seems to me that the alternative patch would be to do extend the
>>>> “Holder” CAS state machine to approximately track concurrency, and fall
>>>> back down to a monitor under contention… I made such a patch on 2.0.x
>> (I’ll
>>>> clean it up some and open an issue), and the results look pretty good:
>>>> 
>>>> This is the same sort of data as before, except for every thread_count,
>>>> element_count, element_size, partition_count tuple, I run with the
>> original
>>>> version of addAllWithSizeDelta followed by the modified version.
>>>> 
>>>> Note the raw data size is now definitely a bit too small (just the size
>> of
>>>> the UUID & the byte[] value), however relative numbers are interesting.
>>>> 
>>>> There are three levels of code:
>>>> 
>>>> “fast” - the original spin loop… thread coordination wise, it just reads
>>>> an AtomicReference, copies and updates into a new state and attempts to
>> CAS
>>>> it back
>>>> “medium” - the same as fast, but inside a try/finally so that it can
>> track
>>>> concurrency in an atomic integer instance (not available in fast path)
>>>> “slow” - the same as medium, but protected by the AtomicSortedColumns
>>>> instance’s monitor
>>>> 
>>>> the a/b number pairs below for these leves, are (a) number method
>>>> invocations that passed included that level, and (b) the number of loop
>>>> iterations in that level… If you look closely you’ll see the mix of
>> levels
>>>> changing based on how much contention there is (note up/down are
>>>> approximate counts for such state changes for an individual
>>>> AtomicSortedColumns instance (aka partition in my test)
>>>> 
>>>> Here is the worst case scenario:
>>>> 
>>>>   [junit] Threads = 100 elements = 100000 (of size 64) partitions = 1
>>>>   [junit]   Duration = 1323ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 94 ms for 28 collections
>>>>   [junit]   Approx allocation = 9183MB vs 7MB; ratio to raw data size =
>>>> 1203.740583
>>>>   [junit]   fast 100000/1542949 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 1521ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 15 ms for 1 collections
>>>>   [junit]   Approx allocation = 550MB vs 7MB; ratio to raw data size =
>>>> 72.21061
>>>>   [junit]   fast 220/233 medium 3/3 slow 99779/99780 up 1 down 1
>>>> 
>>>> Note that the original code did approximately 15 loop iterations (spins)
>>>> per call - and ended up allocating about 9gig for 100000 64 byte values
>>>> 
>>>> The new code pretty much reverted to an object monitor in this case, and
>>>> only allocated about 550meg.
>>>> 
>>>> (In this case the new code was slower, but this generally seems to be
>>>> pretty much within the margin of error - I need a longer test to
>> determine
>>>> that - there is really no reason it should be slower (other than I have
>>>> some debug counters in there)
>>>> 
>>>> Here’s another one picked at random with a mix of different levels (note
>>>> that the original code is slower because it does a bunch of memory
>>>> allocating spins - 190622 iterations for 1000000 inserts)
>>>> 
>>>>   [junit] Threads = 100 elements = 100000 (of size 256) partitions = 16
>>>>   [junit]   Duration = 216ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>>>   [junit]   Approx allocation = 772MB vs 25MB; ratio to raw data size =
>>>> 29.778822352941177
>>>>   [junit]   fast 100000/190622 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 215ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 27 ms for 2 collections
>>>>   [junit]   Approx allocation = 566MB vs 25MB; ratio to raw data size =
>>>> 21.856579705882353
>>>>   [junit]   fast 50873/67546 medium 18283/18919 slow 40230/47257 up
>>>> 10671 down 10339
>>>> 
>>>> Here are the full results
>>>> 
>>>>   [junit] Threads = 1 elements = 100000 (of size 64) partitions = 1
>>>>   [junit]   Duration = 715ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 25 ms for 3 collections
>>>>   [junit]   Approx allocation = 554MB vs 7MB; ratio to raw data size =
>>>> 72.644926
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 364ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 32 ms for 3 collections
>>>>   [junit]   Approx allocation = 566MB vs 7MB; ratio to raw data size =
>>>> 74.21657
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 64) partitions = 16
>>>>   [junit]   Duration = 296ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 18 ms for 2 collections
>>>>   [junit]   Approx allocation = 464MB vs 7MB; ratio to raw data size =
>>>> 60.853517
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 334ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 19 ms for 2 collections
>>>>   [junit]   Approx allocation = 442MB vs 7MB; ratio to raw data size =
>>>> 57.934483
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 64) partitions = 256
>>>>   [junit]   Duration = 250ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 9 ms for 1 collections
>>>>   [junit]   Approx allocation = 329MB vs 7MB; ratio to raw data size =
>>>> 43.249159
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 238ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 9 ms for 1 collections
>>>>   [junit]   Approx allocation = 327MB vs 7MB; ratio to raw data size =
>>>> 42.943091
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 64) partitions = 1024
>>>>   [junit]   Duration = 225ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>>>   [junit]   Approx allocation = 276MB vs 7MB; ratio to raw data size =
>>>> 36.303192
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 231ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>>>   [junit]   Approx allocation = 274MB vs 7MB; ratio to raw data size =
>>>> 36.043359
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 64) partitions = 1
>>>>   [junit]   Duration = 1323ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 94 ms for 28 collections
>>>>   [junit]   Approx allocation = 9183MB vs 7MB; ratio to raw data size =
>>>> 1203.740583
>>>>   [junit]   fast 100000/1542949 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 1521ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 15 ms for 1 collections
>>>>   [junit]   Approx allocation = 550MB vs 7MB; ratio to raw data size =
>>>> 72.21061
>>>>   [junit]   fast 220/233 medium 3/3 slow 99779/99780 up 1 down 1
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 64) partitions = 16
>>>>   [junit]   Duration = 218ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 24 ms for 2 collections
>>>>   [junit]   Approx allocation = 739MB vs 7MB; ratio to raw data size =
>>>> 96.934667
>>>>   [junit]   fast 100000/182402 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 195ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 9 ms for 1 collections
>>>>   [junit]   Approx allocation = 567MB vs 7MB; ratio to raw data size =
>>>> 74.399013
>>>>   [junit]   fast 50466/67007 medium 18078/18750 slow 40781/47692 up
>>>> 10574 down 10249
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 64) partitions = 256
>>>>   [junit]   Duration = 170ms maxConcurrency = 98
>>>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>   [junit]   Approx allocation = 341MB vs 7MB; ratio to raw data size =
>>>> 44.826425
>>>>   [junit]   fast 100000/102082 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 174ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>   [junit]   Approx allocation = 344MB vs 7MB; ratio to raw data size =
>>>> 45.096833
>>>>   [junit]   fast 98009/100008 medium 2003/2006 slow 51/94 up 1964 down
>>>> 1958
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 64) partitions =
>> 1024
>>>>   [junit]   Duration = 168ms maxConcurrency = 95
>>>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>   [junit]   Approx allocation = 283MB vs 7MB; ratio to raw data size =
>>>> 37.095636
>>>>   [junit]   fast 100000/100427 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 177ms maxConcurrency = 98
>>>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>   [junit]   Approx allocation = 292MB vs 7MB; ratio to raw data size =
>>>> 38.384233
>>>>   [junit]   fast 99596/100001 medium 404/404 slow 0/0 up 405 down 404
>>>>   [junit]
>>>>   [junit] --------------------------------------------------
>>>>   [junit] Threads = 1 elements = 100000 (of size 256) partitions = 1
>>>>   [junit]   Duration = 793ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 23 ms for 2 collections
>>>>   [junit]   Approx allocation = 558MB vs 25MB; ratio to raw data size =
>>>> 21.542246176470588
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 351ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 22 ms for 2 collections
>>>>   [junit]   Approx allocation = 566MB vs 25MB; ratio to raw data size =
>>>> 21.824048823529413
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 256) partitions = 16
>>>>   [junit]   Duration = 316ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 8 ms for 1 collections
>>>>   [junit]   Approx allocation = 456MB vs 25MB; ratio to raw data size =
>>>> 17.608017352941175
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 789ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 12 ms for 1 collections
>>>>   [junit]   Approx allocation = 462MB vs 25MB; ratio to raw data size =
>>>> 17.84756294117647
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 256) partitions = 256
>>>>   [junit]   Duration = 612ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>   [junit]   Approx allocation = 349MB vs 25MB; ratio to raw data size =
>>>> 13.467229411764706
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 495ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>   [junit]   Approx allocation = 351MB vs 25MB; ratio to raw data size =
>>>> 13.550963235294118
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 256) partitions = 1024
>>>>   [junit]   Duration = 422ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>   [junit]   Approx allocation = 292MB vs 25MB; ratio to raw data size =
>>>> 11.267923823529411
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 405ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 12 ms for 1 collections
>>>>   [junit]   Approx allocation = 294MB vs 25MB; ratio to raw data size =
>>>> 11.371588235294118
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 256) partitions = 1
>>>>   [junit]   Duration = 2197ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 122 ms for 32 collections
>>>>   [junit]   Approx allocation = 10317MB vs 25MB; ratio to raw data size
>>>> = 397.75221411764704
>>>>   [junit]   fast 100000/1777796 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 1253ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 17 ms for 1 collections
>>>>   [junit]   Approx allocation = 565MB vs 25MB; ratio to raw data size =
>>>> 21.804268823529412
>>>>   [junit]   fast 582/585 medium 3/3 slow 99417/99418 up 1 down 1
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 256) partitions = 16
>>>>   [junit]   Duration = 216ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>>>   [junit]   Approx allocation = 772MB vs 25MB; ratio to raw data size =
>>>> 29.778822352941177
>>>>   [junit]   fast 100000/190622 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 215ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 27 ms for 2 collections
>>>>   [junit]   Approx allocation = 566MB vs 25MB; ratio to raw data size =
>>>> 21.856579705882353
>>>>   [junit]   fast 50873/67546 medium 18283/18919 slow 40230/47257 up
>>>> 10671 down 10339
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 256) partitions =
>> 256
>>>>   [junit]   Duration = 187ms maxConcurrency = 97
>>>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>   [junit]   Approx allocation = 345MB vs 25MB; ratio to raw data size =
>>>> 13.314321470588235
>>>>   [junit]   fast 100000/101946 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 183ms maxConcurrency = 98
>>>>   [junit]   GC for PS Scavenge: 12 ms for 1 collections
>>>>   [junit]   Approx allocation = 354MB vs 25MB; ratio to raw data size =
>>>> 13.662984705882353
>>>>   [junit]   fast 98108/100024 medium 1911/1914 slow 32/61 up 1887 down
>>>> 1882
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 256) partitions =
>> 1024
>>>>   [junit]   Duration = 185ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>   [junit]   Approx allocation = 293MB vs 25MB; ratio to raw data size =
>>>> 11.313215294117647
>>>>   [junit]   fast 100000/100399 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 176ms maxConcurrency = 94
>>>>   [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>   [junit]   Approx allocation = 294MB vs 25MB; ratio to raw data size =
>>>> 11.350447647058823
>>>>   [junit]   fast 99573/100000 medium 426/426 slow 2/4 up 426 down 424
>>>>   [junit]
>>>>   [junit] --------------------------------------------------
>>>>   [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 1
>>>>   [junit]   Duration = 993ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 29 ms for 2 collections
>>>>   [junit]   Approx allocation = 628MB vs 99MB; ratio to raw data size =
>>>> 6.339322461538462
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 797ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 40 ms for 3 collections
>>>>   [junit]   Approx allocation = 623MB vs 99MB; ratio to raw data size =
>>>> 6.285295538461538
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 16
>>>>   [junit]   Duration = 886ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 29 ms for 2 collections
>>>>   [junit]   Approx allocation = 508MB vs 99MB; ratio to raw data size =
>>>> 5.130343307692308
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 820ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 28 ms for 2 collections
>>>>   [junit]   Approx allocation = 535MB vs 99MB; ratio to raw data size =
>>>> 5.398294076923077
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 1024) partitions = 256
>>>>   [junit]   Duration = 594ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 28 ms for 2 collections
>>>>   [junit]   Approx allocation = 392MB vs 99MB; ratio to raw data size =
>>>> 3.9560997692307693
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 484ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 30 ms for 2 collections
>>>>   [junit]   Approx allocation = 410MB vs 99MB; ratio to raw data size =
>>>> 4.141943
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 1 elements = 100000 (of size 1024) partitions =
>> 1024
>>>>   [junit]   Duration = 401ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 20 ms for 2 collections
>>>>   [junit]   Approx allocation = 474MB vs 99MB; ratio to raw data size =
>>>> 4.779102
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 405ms maxConcurrency = 1
>>>>   [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>>>   [junit]   Approx allocation = 367MB vs 99MB; ratio to raw data size =
>>>> 3.7006123846153844
>>>>   [junit]   fast 100000/100000 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 1024) partitions = 1
>>>>   [junit]   Duration = 2383ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 144 ms for 37 collections
>>>>   [junit]   Approx allocation = 11728MB vs 99MB; ratio to raw data size
>>>> = 118.24829784615385
>>>>   [junit]   fast 100000/1781388 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 1298ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 54 ms for 2 collections
>>>>   [junit]   Approx allocation = 490MB vs 99MB; ratio to raw data size =
>>>> 4.945822
>>>>   [junit]   fast 312/314 medium 2/2 slow 99687/99688 up 1 down 1
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 1024) partitions =
>> 16
>>>>   [junit]   Duration = 231ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 38 ms for 3 collections
>>>>   [junit]   Approx allocation = 856MB vs 99MB; ratio to raw data size =
>>>> 8.638853769230769
>>>>   [junit]   fast 100000/189676 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 237ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 40 ms for 3 collections
>>>>   [junit]   Approx allocation = 634MB vs 99MB; ratio to raw data size =
>>>> 6.397918692307693
>>>>   [junit]   fast 55642/73346 medium 19132/19867 slow 34655/41616 up
>>>> 11578 down 11266
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 1024) partitions =
>> 256
>>>>   [junit]   Duration = 212ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 38 ms for 2 collections
>>>>   [junit]   Approx allocation = 411MB vs 99MB; ratio to raw data size =
>>>> 4.148154692307692
>>>>   [junit]   fast 100000/101956 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 209ms maxConcurrency = 99
>>>>   [junit]   GC for PS Scavenge: 29 ms for 2 collections
>>>>   [junit]   Approx allocation = 408MB vs 99MB; ratio to raw data size =
>>>> 4.121041076923077
>>>>   [junit]   fast 98133/100004 medium 1876/1879 slow 39/73 up 1845 down
>>>> 1842
>>>>   [junit]
>>>>   [junit] Threads = 100 elements = 100000 (of size 1024) partitions =
>>>> 1024
>>>>   [junit]   Duration = 203ms maxConcurrency = 97
>>>>   [junit]   GC for PS Scavenge: 30 ms for 2 collections
>>>>   [junit]   Approx allocation = 368MB vs 99MB; ratio to raw data size =
>>>> 3.7190386923076924
>>>>   [junit]   fast 100000/100417 medium 0/0 slow 0/0 up 0 down 0
>>>>   [junit]
>>>>   [junit]  -
>>>>   [junit]   Duration = 201ms maxConcurrency = 100
>>>>   [junit]   GC for PS Scavenge: 31 ms for 2 collections
>>>>   [junit]   Approx allocation = 360MB vs 99MB; ratio to raw data size =
>>>> 3.632088230769231
>>>>   [junit]   fast 99611/100000 medium 389/389 slow 1/2 up 389 down 388
>>>>   [junit]
>>>> 
>>>> On Jul 13, 2014, at 10:16 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>>> 
>>>> Thanks for the analysis.  It should be pretty straightforward to
>>>> assign hints to one of 10 random partitions, then handoff can just
>>>> scan for each partition key.  Small penalty on handoff to mitigate
>>>> your problem.
>>>> 
>>>> On Sun, Jul 13, 2014 at 4:25 PM, graham sanderson <graham@vast.com>
>> wrote:
>>>> 
>>>> Here are some numbers from latest 2.1 (on the production level h/w)
>>>> 
>>>> BTree is better off than was SnapTreeMap (note also that I allocated my
>>>> cells on heap, however all of the extra allocation under race conditions
>>>> itself is actually the Object[], so is going to be on heap anyway):
>>>> - Generally the data structure overhead is much smaller
>>>> - The contention cost drops off much quicker with number of partitions
>>>> (i.e.
>>>> 16 is actually pretty fine, though if you’re going with >1 then 256 is
>> not
>>>> much harder than 16 unless you want to use a single IN clause).
>>>> 
>>>> However it still does suffer when a huge amount of almost sorted data is
>>>> inserted very fast as in the hints case (note I tried with random
>>>> timestamps
>>>> out of curiosity, and it was somewhat better as expected, but not
>>>> significantly).
>>>> 
>>>> So the moral of the story seems to be that we probably shouldn’t be
>> writing
>>>> hints for a node to a single partition… (ironically lots of small
>> inserts
>>>> means that the write rate is faster, hence more concurrency of hinting,
>> and
>>>> a mere 100M of inserted data ends up allocating 10s of gigs of RAM on
>> the
>>>> node during the hinting process) any thoughts appreciated, but I’ll
>> likely
>>>> move thread to a ticket shortly.
>>>> 
>>>> Note, finally that with SnapTreeMap a lot of extra rapidly allocated
>> data
>>>> was live long enough to spill into the old gen, causing eventual
>> promotion
>>>> failure of memtables on the hinting nodes too, that may not be the case
>>>> with
>>>> BTree on 2.1, and of course off heap memtables will help this issue too.
>>>> 
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 1129ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 36 ms for 2 collections
>>>>  [junit]   Approx allocation = 249MB vs 38MB; ratio to raw data size =
>>>> 6.5352668
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>>>>  [junit]   Duration = 349ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>>>  [junit]   Approx allocation = 197MB vs 38MB; ratio to raw data size =
>>>> 5.1714
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>>>>  [junit]   Duration = 835ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 19 ms for 1 collections
>>>>  [junit]   Approx allocation = 147MB vs 38MB; ratio to raw data size =
>>>> 3.8699954
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 635ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 23 ms for 1 collections
>>>>  [junit]   Approx allocation = 118MB vs 38MB; ratio to raw data size =
>>>> 3.1053228
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 2079ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 130 ms for 24 collections
>>>>  [junit]   Approx allocation = 6484MB vs 38MB; ratio to raw data size =
>>>> 169.9863816
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>>>>  [junit]   Duration = 363ms maxConcurrency = 100
>>>>  [junit]   Approx allocation = 238MB vs 38MB; ratio to raw data size =
>>>> 6.2529482
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>>>>  [junit]   Duration = 370ms maxConcurrency = 96
>>>>  [junit]   Approx allocation = 154MB vs 38MB; ratio to raw data size =
>>>> 4.055179
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 353ms maxConcurrency = 100
>>>>  [junit]   Approx allocation = 124MB vs 38MB; ratio to raw data size =
>>>> 3.2727604
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>>>>  [junit]   Duration = 1055ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 19 ms for 1 collections
>>>>  [junit]   Approx allocation = 237MB vs 83MB; ratio to raw data size =
>>>> 2.831478363636364
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>>>>  [junit]   Duration = 667ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 16 ms for 1 collections
>>>>  [junit]   Approx allocation = 235MB vs 83MB; ratio to raw data size =
>>>> 2.802512909090909
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>>>>  [junit]   Duration = 606ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 19 ms for 1 collections
>>>>  [junit]   Approx allocation = 175MB vs 83MB; ratio to raw data size =
>>>> 2.091785727272727
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>>>>  [junit]   Duration = 268ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>  [junit]   Approx allocation = 149MB vs 83MB; ratio to raw data size =
>>>> 1.7858322727272726
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>>>>  [junit]   Duration = 2120ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 143 ms for 23 collections
>>>>  [junit]   Approx allocation = 6248MB vs 83MB; ratio to raw data size =
>>>> 74.454192
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>>>>  [junit]   Duration = 310ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 33 ms for 1 collections
>>>>  [junit]   Approx allocation = 236MB vs 83MB; ratio to raw data size =
>>>> 2.817435
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>>>>  [junit]   Duration = 383ms maxConcurrency = 99
>>>>  [junit]   Approx allocation = 199MB vs 83MB; ratio to raw data size =
>>>> 2.3777502727272726
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions =
>> 1024
>>>>  [junit]   Duration = 387ms maxConcurrency = 98
>>>>  [junit]   GC for PS Scavenge: 25 ms for 1 collections
>>>>  [junit]   Approx allocation = 176MB vs 83MB; ratio to raw data size =
>>>> 2.1082142727272726
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>>>>  [junit]   Duration = 1230ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 64 ms for 3 collections
>>>>  [junit]   Approx allocation = 457MB vs 267MB; ratio to raw data size =
>>>> 1.711579
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>>>>  [junit]   Duration = 898ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 58 ms for 3 collections
>>>>  [junit]   Approx allocation = 427MB vs 267MB; ratio to raw data size =
>>>> 1.6025206571428572
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>>>>  [junit]   Duration = 944ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 72 ms for 3 collections
>>>>  [junit]   Approx allocation = 377MB vs 267MB; ratio to raw data size =
>>>> 1.4138783714285714
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>>>>  [junit]   Duration = 877ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 78 ms for 3 collections
>>>>  [junit]   Approx allocation = 347MB vs 267MB; ratio to raw data size =
>>>> 1.3008453714285715
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>>>>  [junit]   Duration = 2073ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 166 ms for 28 collections
>>>>  [junit]   Approx allocation = 7052MB vs 267MB; ratio to raw data size
>> =
>>>> 26.410422285714287
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>>>>  [junit]   Duration = 398ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 45 ms for 2 collections
>>>>  [junit]   Approx allocation = 375MB vs 267MB; ratio to raw data size =
>>>> 1.4073654
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions =
>> 256
>>>>  [junit]   Duration = 448ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 71 ms for 3 collections
>>>>  [junit]   Approx allocation = 369MB vs 267MB; ratio to raw data size =
>>>> 1.3841326
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions =
>> 1024
>>>>  [junit]   Duration = 448ms maxConcurrency = 99
>>>>  [junit]   GC for PS Scavenge: 77 ms for 3 collections
>>>>  [junit]   Approx allocation = 354MB vs 267MB; ratio to raw data size =
>>>> 1.3269438571428571
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] ==================================================
>>>> 
>>>> On Jul 13, 2014, at 11:34 AM, graham sanderson <graham@vast.com> wrote:
>>>> 
>>>> OK, some numbers from my test:
>>>> 
>>>> Basically the test mimics the 2.0.x write path for a hint -
>>>> there are P partitions (each has an AtomicReference to a SnapTreeMap
>> sorted
>>>> by timeuuid comparator)
>>>> there are N threads
>>>> “elements” is the total number of timeUUID/value pairs (aka columns)
>>>> each thread inserts the same number of elements (“elements”/N)
>>>> inserting the element is done by getting the atomic reference, cloning
>> the
>>>> SnapTreeMap, adding the element, and trying to CAS the updated reference
>>>> back.
>>>> the value of the element in this case is a byte[] of the size indicated
>>>> allocated by the current thread
>>>> 
>>>> the test counts approximately the amount of eden allocation… the ratio
>>>> shown
>>>> is how much was allocated vs the rough size of the raw data to be
>> inserted.
>>>> 
>>>> This test is a worst case scenario in so far as there is no sleeping
>>>> (though
>>>> on my laptop as you can see from maxConcurrency - which is the maximum
>>>> number of threads in the element loop at any point, I fail to hit it
>> that
>>>> very hard, whereas one of our production class machines can maintain 100
>>>> thread concurrency)
>>>> 
>>>> Full results are at the bottom (note I ran with 1 thread an 100 threads,
>>>> element sizes 64, 256, 1024 bytes, and 1, 16, 256, 1024 partitions)
>>>> 
>>>> Here are some worst case examples (obviously the ratio is higher for
>> even
>>>> high partition counts here, because the size of even a single tree
>>>> structure
>>>> becomes significant compared to the size of the data stored):
>>>> 
>>>> 
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 3922ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 101 ms for 28 collections
>>>>  [junit]   Approx allocation = 8691MB vs 38MB; ratio to raw data size =
>>>> 227.833744
>>>>  [junit]
>>>> 
>>>> vs
>>>> 
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 399ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 21 ms for 1 collections
>>>>  [junit]   Approx allocation = 320MB vs 38MB; ratio to raw data size =
>>>> 8.4010366
>>>> 
>>>> Note, I’ll try the same test on 2.1 against the new write path.
>>>> 
>>>> =====
>>>> 
>>>> FYI insertion of one element
>>>> 
>>>>                      AtomicReference<SnapTreeMap<ByteBuffer, byte[]>>
>> ref
>>>> = maps.get(random.nextInt(partitionCount));
>>>>                      SnapTreeMap<ByteBuffer, byte[]> current, modified;
>>>>                      do
>>>>                      {
>>>>                          current = ref.get();
>>>>                          modified = current.clone();
>>>>                          if (null != modified.putIfAbsent(key, value))
>>>>                          {
>>>>                              modified.replace(key, value);
>>>>                          }
>>>>                      } while (!ref.compareAndSet(current, modified));
>>>> 
>>>> 
>>>> Full results from my laptop
>>>> 
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 499ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 29 ms for 4 collections
>>>>  [junit]   Approx allocation = 539MB vs 38MB; ratio to raw data size =
>>>> 14.1319278
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>>>>  [junit]   Duration = 445ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 26 ms for 3 collections
>>>>  [junit]   Approx allocation = 446MB vs 38MB; ratio to raw data size =
>>>> 11.7168844
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>>>>  [junit]   Duration = 395ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 20 ms for 2 collections
>>>>  [junit]   Approx allocation = 370MB vs 38MB; ratio to raw data size =
>>>> 9.7106606
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 399ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 24 ms for 2 collections
>>>>  [junit]   Approx allocation = 311MB vs 38MB; ratio to raw data size =
>>>> 8.1768922
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 1143ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 148 ms for 18 collections
>>>>  [junit]   Approx allocation = 3853MB vs 38MB; ratio to raw data size =
>>>> 101.0072052
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>>>>  [junit]   Duration = 171ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 22 ms for 2 collections
>>>>  [junit]   Approx allocation = 605MB vs 38MB; ratio to raw data size =
>>>> 15.8683486
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>>>>  [junit]   Duration = 8ms maxConcurrency = 12
>>>>  [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>  [junit]   Approx allocation = 370MB vs 38MB; ratio to raw data size =
>>>> 9.713281
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 3ms maxConcurrency = 9
>>>>  [junit]   GC for PS Scavenge: 20 ms for 1 collections
>>>>  [junit]   Approx allocation = 327MB vs 38MB; ratio to raw data size =
>>>> 8.5870928
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>>>>  [junit]   Duration = 492ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 25 ms for 2 collections
>>>>  [junit]   Approx allocation = 557MB vs 83MB; ratio to raw data size =
>>>> 6.638187636363637
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>>>>  [junit]   Duration = 432ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 26 ms for 2 collections
>>>>  [junit]   Approx allocation = 470MB vs 83MB; ratio to raw data size =
>>>> 5.603587909090909
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>>>>  [junit]   Duration = 430ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 38 ms for 3 collections
>>>>  [junit]   Approx allocation = 406MB vs 83MB; ratio to raw data size =
>>>> 4.845429545454546
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>>>>  [junit]   Duration = 433ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 41 ms for 3 collections
>>>>  [junit]   Approx allocation = 355MB vs 83MB; ratio to raw data size =
>>>> 4.231936636363637
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>>>>  [junit]   Duration = 1154ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 163 ms for 25 collections
>>>>  [junit]   Approx allocation = 4344MB vs 83MB; ratio to raw data size =
>>>> 51.769667727272726
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>>>>  [junit]   Duration = 279ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 61 ms for 4 collections
>>>>  [junit]   Approx allocation = 506MB vs 83MB; ratio to raw data size =
>>>> 6.037446818181818
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>>>>  [junit]   Duration = 6ms maxConcurrency = 11
>>>>  [junit]   GC for PS Scavenge: 36 ms for 3 collections
>>>>  [junit]   Approx allocation = 402MB vs 83MB; ratio to raw data size =
>>>> 4.795717636363636
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions =
>> 1024
>>>>  [junit]   Duration = 5ms maxConcurrency = 38
>>>>  [junit]   GC for PS Scavenge: 46 ms for 3 collections
>>>>  [junit]   Approx allocation = 362MB vs 83MB; ratio to raw data size =
>>>> 4.3233903636363635
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>>>>  [junit]   Duration = 595ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 109 ms for 6 collections
>>>>  [junit]   Approx allocation = 766MB vs 267MB; ratio to raw data size =
>>>> 2.868619457142857
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>>>>  [junit]   Duration = 503ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 66 ms for 5 collections
>>>>  [junit]   Approx allocation = 676MB vs 267MB; ratio to raw data size =
>>>> 2.531949057142857
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>>>>  [junit]   Duration = 498ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 78 ms for 5 collections
>>>>  [junit]   Approx allocation = 583MB vs 267MB; ratio to raw data size =
>>>> 2.1853203714285714
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>>>>  [junit]   Duration = 473ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 86 ms for 4 collections
>>>>  [junit]   Approx allocation = 537MB vs 267MB; ratio to raw data size =
>>>> 2.012229685714286
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>>>>  [junit]   Duration = 1338ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 197 ms for 30 collections
>>>>  [junit]   Approx allocation = 3883MB vs 267MB; ratio to raw data size
>> =
>>>> 14.5419178
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>>>>  [junit]   Duration = 278ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 128 ms for 6 collections
>>>>  [junit]   Approx allocation = 752MB vs 267MB; ratio to raw data size =
>>>> 2.817227914285714
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions =
>> 256
>>>>  [junit]   Duration = 80ms maxConcurrency = 42
>>>>  [junit]   GC for PS Scavenge: 99 ms for 5 collections
>>>>  [junit]   Approx allocation = 596MB vs 267MB; ratio to raw data size =
>>>> 2.2341639714285715
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions =
>> 1024
>>>>  [junit]   Duration = 4ms maxConcurrency = 26
>>>>  [junit]   GC for PS Scavenge: 76 ms for 4 collections
>>>>  [junit]   Approx allocation = 547MB vs 267MB; ratio to raw data size =
>>>> 2.0490342285714287
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] ==================================================
>>>> 
>>>> Full results from a 16 core (32 thread) machine
>>>> 
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 1431ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 52 ms for 4 collections
>>>>  [junit]   Approx allocation = 572MB vs 38MB; ratio to raw data size =
>>>> 15.0195638
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>>>>  [junit]   Duration = 1216ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 46 ms for 3 collections
>>>>  [junit]   Approx allocation = 491MB vs 38MB; ratio to raw data size =
>>>> 12.8950428
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>>>>  [junit]   Duration = 899ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 22 ms for 2 collections
>>>>  [junit]   Approx allocation = 458MB vs 38MB; ratio to raw data size =
>>>> 12.0219362
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 546ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 10 ms for 1 collections
>>>>  [junit]   Approx allocation = 309MB vs 38MB; ratio to raw data size =
>>>> 8.1213414
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>>>>  [junit]   Duration = 3922ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 101 ms for 28 collections
>>>>  [junit]   Approx allocation = 8691MB vs 38MB; ratio to raw data size =
>>>> 227.833744
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>>>>  [junit]   Duration = 255ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 18 ms for 2 collections
>>>>  [junit]   Approx allocation = 654MB vs 38MB; ratio to raw data size =
>>>> 17.1485322
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>>>>  [junit]   Duration = 225ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 13 ms for 1 collections
>>>>  [junit]   Approx allocation = 368MB vs 38MB; ratio to raw data size =
>>>> 9.6528628
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>>>>  [junit]   Duration = 399ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 21 ms for 1 collections
>>>>  [junit]   Approx allocation = 320MB vs 38MB; ratio to raw data size =
>>>> 8.4010366
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>>>>  [junit]   Duration = 1350ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 31 ms for 2 collections
>>>>  [junit]   Approx allocation = 462MB vs 83MB; ratio to raw data size =
>>>> 5.509877818181818
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>>>>  [junit]   Duration = 861ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 28 ms for 2 collections
>>>>  [junit]   Approx allocation = 491MB vs 83MB; ratio to raw data size =
>>>> 5.854580272727273
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>>>>  [junit]   Duration = 445ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 10 ms for 1 collections
>>>>  [junit]   Approx allocation = 399MB vs 83MB; ratio to raw data size =
>>>> 4.764038727272728
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>>>>  [junit]   Duration = 403ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 10 ms for 1 collections
>>>>  [junit]   Approx allocation = 356MB vs 83MB; ratio to raw data size =
>>>> 4.242789363636364
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>>>>  [junit]   Duration = 3530ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 104 ms for 30 collections
>>>>  [junit]   Approx allocation = 9261MB vs 83MB; ratio to raw data size =
>>>> 110.35432245454545
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>>>>  [junit]   Duration = 231ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 17 ms for 2 collections
>>>>  [junit]   Approx allocation = 722MB vs 83MB; ratio to raw data size =
>>>> 8.607574
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>>>>  [junit]   Duration = 205ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 11 ms for 1 collections
>>>>  [junit]   Approx allocation = 411MB vs 83MB; ratio to raw data size =
>>>> 4.901291545454545
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 256) partitions =
>> 1024
>>>>  [junit]   Duration = 343ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 14 ms for 1 collections
>>>>  [junit]   Approx allocation = 362MB vs 83MB; ratio to raw data size =
>>>> 4.321081454545454
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>>>>  [junit]   Duration = 1039ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 63 ms for 5 collections
>>>>  [junit]   Approx allocation = 597MB vs 267MB; ratio to raw data size =
>>>> 2.238462485714286
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>>>>  [junit]   Duration = 1285ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 50 ms for 4 collections
>>>>  [junit]   Approx allocation = 687MB vs 267MB; ratio to raw data size =
>>>> 2.5740599428571427
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>>>>  [junit]   Duration = 871ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 50 ms for 4 collections
>>>>  [junit]   Approx allocation = 576MB vs 267MB; ratio to raw data size =
>>>> 2.158096
>>>>  [junit]
>>>>  [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>>>>  [junit]   Duration = 624ms maxConcurrency = 1
>>>>  [junit]   GC for PS Scavenge: 47 ms for 4 collections
>>>>  [junit]   Approx allocation = 531MB vs 267MB; ratio to raw data size =
>>>> 1.9889662
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>>>>  [junit]   Duration = 3204ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 121 ms for 35 collections
>>>>  [junit]   Approx allocation = 10597MB vs 267MB; ratio to raw data
>> size =
>>>> 39.68819474285714
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>>>>  [junit]   Duration = 265ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 30 ms for 3 collections
>>>>  [junit]   Approx allocation = 508MB vs 267MB; ratio to raw data size =
>>>> 1.9032105142857143
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions =
>> 256
>>>>  [junit]   Duration = 323ms maxConcurrency = 100
>>>>  [junit]   GC for PS Scavenge: 46 ms for 4 collections
>>>>  [junit]   Approx allocation = 582MB vs 267MB; ratio to raw data size =
>>>> 2.180992342857143
>>>>  [junit]
>>>>  [junit] Threads = 100 elements = 250000 (of size 1024) partitions =
>> 1024
>>>>  [junit]   Duration = 434ms maxConcurrency = 99
>>>>  [junit]   GC for PS Scavenge: 66 ms for 4 collections
>>>>  [junit]   Approx allocation = 544MB vs 267MB; ratio to raw data size =
>>>> 2.039445542857143
>>>>  [junit]
>>>>  [junit] --------------------------------------------------
>>>>  [junit] ==================================================
>>>> 
>>>> On Jul 12, 2014, at 9:48 PM, graham sanderson <graham@vast.com> wrote:
>>>> 
>>>> Running on 2.0.5 we’ve noticed occasional MASSIVE spikes in memory
>>>> allocation that happen under write load, but way out of proportion to
>> the
>>>> amount of data actually being written.
>>>> 
>>>> My suspicion is the problem is related to hinted handoff, and basically
>>>> follows the “some sort of non trivial GC pause on one node (probably
>> caused
>>>> by promotion failure of memtable due to slow/eventual fragmentation of
>>>> tenured gen) causes other nodes to start writing hints for that node,
>> and
>>>> the hinting turns out to be actually much worse memory wise than the
>> writes
>>>> themselves.
>>>> 
>>>> Obviously there is the JIRA issue in 3.0 to not store hints in a table,
>> but
>>>> in the meanwhile I thought I’d take a quick peek.
>>>> 
>>>> I’m in the process of writing a test to confirm what I think is
>> happening,
>>>> but I thought I’d throw this out there anyway for discussion in the
>> context
>>>> of 2.1 changes.
>>>> 
>>>> ====
>>>> 
>>>> In this case, we are adding a single column (of small delta data) to a
>> lot
>>>> of partitions at once (lets say a hundred thousand partitions for each
>> node
>>>> after replication) pretty fast with not a lot of data for each probably
>> a
>>>> few hundred bytes max… this really puts very little strain on things at
>>>> all.
>>>> However if we are hinting, then ALL these updates for one target node
>> end
>>>> up
>>>> going to the same partition in system.hints. My assumption that this
>>>> combined with SnapTree is a huge problem. Whilst the snap tree clone
>> itself
>>>> may be quick, we are essentially appending each time, so there is
>> probably
>>>> a
>>>> huge amount of rebalancing going on along with the lazy copy on write
>> and
>>>> the inherent concurrent race waste. I am guestimating that we are
>> actually
>>>> allocating several orders of magnitude more memory than doing the actual
>>>> regular inserts does. The fact that our H/W can handle a very very high
>>>> concurrent write volume probably makes the problem much worse.
>>>> 
>>>> So my first thought, was fine, why not just change hinted handoff to
>> shard
>>>> (an extra partition key column along with the existing node uuid) based
>> on
>>>> say a fixed length 8 bit hash of the original row mutation partition
>> key.
>>>> This would spread the load some and probably solve the problem, but
>> would
>>>> require a schema change to system.hints which is annoying. That said,
>>>> appending (sorted) data concurrently to the same partition is the sort
>> of
>>>> thing that many people might do, though probably not as fast as is done
>> in
>>>> this hinting case, so probably needs to be better anyway.
>>>> 
>>>> So searching, I came across
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6271 for 2.1 which
>>>> replaces
>>>> SnapTree… so my final question is (and I will try to test this on 2.1
>>>> soon).. do we expect that this change will probably make the hint
>> problem
>>>> moot anyways? i.e. heavy concurrent write to the same partition key is
>> AOK,
>>>> or does it make sense to shard hints until 3.x makes everything groovy.
>>>> 
>>>> Thanks,
>>>> 
>>>> Graham.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder, http://www.datastax.com
>>>> @spyced
>> 
>> 

Mime
View raw message