cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pengchao Wang (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-14660) Improve TokenMetaData cache populating performance for large cluster
Date Fri, 24 Aug 2018 08:31:00 GMT


Pengchao Wang updated CASSANDRA-14660:
    Attachment: 14660-trunk.txt

> Improve TokenMetaData cache populating performance for large cluster
> --------------------------------------------------------------------
>                 Key: CASSANDRA-14660
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
>            Reporter: Pengchao Wang
>            Priority: Critical
>              Labels: Performance
>             Fix For: 3.0.x, 3.11.x, 4.x
>         Attachments: 14660-3.0.txt, 14660-trunk.txt,
> TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent token and topology
view on coordinations without paying read lock cost. Upon first read the method acquire a
synchronize lock and generate a copy of major token meta data structures and cached it, and
upon every token meta data changes(due to gossip changes), the cache get cleared and next
read will taking care of cache population.
> For small to medium size clusters this strategy works pretty well. But large clusters
can actually be suffered from the locking since cache populating is much slower. On one of
our largest cluster (~1000 nodes,  125k tokens, C* 3.0.15)  each cache population take
about 500~700ms, and during that there are no requests can go through since synchronize lock
was acquired. This caused waves of timeouts errors when there are large amount gossip messages
propagating cross the cluster, such as in the case of cluster restarting.
> Base on profiling we found that the cost mostly comes from copying tokenToEndpointMap.
It is a SortedBiMultiValueMap made from a forward map use TreeMap and a reverse map use guava
TreeMultiMap. There is an optimization in TreeMap helps reduce copying complexity from O(N*log(N)) to
O(N) when copying from already ordered data. But guava's TreeMultiMap copying missed that
optimization and make it ~10 times slower than it actually need to be on our size of cluster.
> The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a vanilla
TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can copy it O(N)
> I also attached a benchmark script (, which simulates a large
cluster then measures average latency for TokenMetaData cache populating.
> Benchmark result before and after that patch:
> {code:java}
> trunk: 
> before 100ms, after 13ms
> 3.0.x: 
> before 199ms, after 15ms
>  {code}
> (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N) optimization
is not applied because the key comparator is dynamically created and TreeMap cannot determine
the source and dest are in same order)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message