cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-7282) Faster Memtable map
Date Tue, 10 Feb 2015 23:23:13 GMT


Benedict commented on CASSANDRA-7282:

I've pushed an updated branch [here|]
that I hope addresses all of your concerns.

bq. LongNonBlockingHashOrderedMapTest times out on default settings for ant long-test on my

The test is really designed to be run on a machine with at least 8 physical cores, and it
may take an excessively long time on fewer. In general it's hard to make a test like this
terminate in a predictable time horizon, and the longer it runs the better (ideally we'd just
leave it running for days). I've tweaked the settings slightly to make it run better on lower
core counts but still actually perform some useful work.

bq. this.index aliasing in predecessor appears redundant

This is necessary to save fetching a different version of the index each time, and incurring
the memory access costs with each reference

bq. Would it be worth making maybeResize threshold configurable? What's our reasoning for
targeting 66%?

66% seems pretty reasonable (defaults are usually somewhere between 50% and 75% for hash tables).
We don't generally expose this kind of internal feature to users, but I am not totally opposed
to doing so. I doubt a value much different from this is likely to help much, though; a value
of < 50% could mean the index becomes more stale and see we have more scanning forwards
to do on lookups, whereas a value of above 100% means we start seeing significant worst case

bq. in putIfAbsent / maybeResize / resize: Just to confirm - are we relying on the executor
to drop duplicate resize calls on the floor? If so, it might be worth documenting.

It doesn't matter if it does or does not; it only ever resizes upwards. We also shouldn't
ever have a duplicate resize call, unless we haven't executed by the time the table's contents
have doubled.

bq. Why do we bit shift on generation of index node instead of just using integer value?

This is a standard practice for me: shifts in multiples of 10 are powers of ~1000 (i.e. 1024).
I find it much neater than lots of "magic constants" when you get above 1024 (e.g. 1048576
or 1073741824) , or snippets of  1024L * 1024L * 1024L. 1 << 30 is much clearer (representing
1024^3). However it's pretty marginal utility when working with small quantities, so I've
switched it. I'd prefer to see it used more widely in the codebase though.

I've spent a while trying to comment very clearly the algorithm at the top of the class, with
diagrams. Please let me know if it's a success, and if there's anything more I can do to clarify

> Faster Memtable map
> -------------------
>                 Key: CASSANDRA-7282
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>              Labels: performance
>             Fix For: 3.0
>         Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg
> Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in our
memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use
a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The
list would impose the normal order on the collection, but a hash index would live alongside
as part of the same data structure, simply mapping into the list and permitting O(1) lookups
and inserts.
> I've chosen to implement this initial version as a linked-list node per item, but we
can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes
to be checked at once,  further reducing the constant factor costs for lookups.

This message was sent by Atlassian JIRA

View raw message