From commits-return-213323-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Fri Aug 24 00:27:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A6CDF18061A for ; Fri, 24 Aug 2018 00:27:03 +0200 (CEST) Received: (qmail 99377 invoked by uid 500); 23 Aug 2018 22:27:02 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 99366 invoked by uid 99); 23 Aug 2018 22:27:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Aug 2018 22:27:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 52E88180958 for ; Thu, 23 Aug 2018 22:27:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id mg_CSR8XdI4Z for ; Thu, 23 Aug 2018 22:27:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D950C5F1B9 for ; Thu, 23 Aug 2018 22:27:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5E1F4E0111 for ; Thu, 23 Aug 2018 22:27:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 16F0A23F98 for ; Thu, 23 Aug 2018 22:27:00 +0000 (UTC) Date: Thu, 23 Aug 2018 22:27:00 +0000 (UTC) From: "Benedict (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-14660) Improve TokenMetaData cache populating performance for large cluster MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-14660?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D16590859#comment-16590859 ]=20 Benedict edited comment on CASSANDRA-14660 at 8/23/18 10:26 PM: ---------------------------------------------------------------- bq. In HasMultimap the value collection is a HashSet, which does not work f= or us. E.g getting tokens for a specific endpoints will not give us sorted = token anymore. {{MultimapBuilder}} can be used to construct a {{Multimap}} with hashed key= s and {{TreeSet}} values. We would not want to use the {{build(Multimap)}} method, as it invokes {{put(K, V)}} for each entry, but if we instea= d iterated over {{asMap().entrySet()}} and added each {{TreeSet}} individua= lly, we should get the linear complexity construction of the {{TreeSet}} co= ntainers. Thinking about it, it's probable this change alone (iterating over {{asMap(= ).entrySet()}}) would be sufficient to restore adequate performance, even i= f we continued to maintain a {{TreeMap}} for the key lookup (i.e. continued= to use a {{TreeMultimap}}). This should be a 3 line change, in {{SortedBi= MultiValMap}} and tide us over until we can properly overhaul {{TokenMetada= ta}}, and have no wider semantic implications to worry about. was (Author: benedict): bq. In HasMultimap the value collection is a HashSet, which does not work f= or us. E.g getting tokens for a specific endpoints will not give us sorted = token anymore. {{MultimapBuilder}} can be used to construct a {{Multimap}} with hashed key= s and {{TreeSet}} values. We would not want to use the {{build(Multimap)}} method, as it invokes {{put(K, V)}} for each entry, but if we instea= d iterated over {{asMap()}} and added each {{TreeSet}} individually, we sho= uld get the linear complexity construction of the {{TreeSet}} containers. Thinking about it, it's probable this change alone would be sufficient to r= estore adequate performance, even if we continued to maintain a {{TreeMap}}= for the key lookup (i.e. continued to use a TreeMultimap). This should be= a 3 line change, in {{SortedBiMultiValMap}} and tide us over until we can = properly overhaul {{TokenMetadata}}, and have no wider semantic implication= s to worry about. > Improve TokenMetaData cache populating performance for large cluster > -------------------------------------------------------------------- > > Key: CASSANDRA-14660 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1466= 0 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP > Reporter: Pengchao Wang > Priority: Critical > Labels: Performance > Fix For: 3.0.x, 3.11.x, 4.x > > Attachments: 14660-trunk.txt, TokenMetaDataBenchmark.java > > > TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent = token and topology view on coordinations without paying read lock cost. Upo= n first read the method acquire a synchronize lock and generate a copy of m= ajor token meta data structures and cached it, and upon every token meta da= ta changes(due to gossip changes), the cache get cleared and next read will= taking care of cache population. > For small to medium size clusters this strategy works pretty well. But la= rge clusters can actually be suffered from the locking since cache populati= ng is much slower. On one of our=C2=A0largest cluster (~1000 nodes,=C2=A0 1= 25k tokens, C* 3.0.15)=C2=A0 each cache population take about 500~700ms, an= d during=C2=A0that there are no requests can go through since synchronize l= ock was acquired. This caused waves of timeouts errors when there are large= amount gossip messages propagating cross the cluster, such as in the case = of cluster restarting. > Base on profiling we found that the cost mostly comes from copying tokenT= oEndpointMap. It is a SortedBiMultiValueMap made from a forward map use Tre= eMap and a reverse map use guava TreeMultiMap. There is an optimization in = TreeMap helps reduce copying complexity from O(N*log(N))=C2=A0to O(N)=C2=A0= when copying from already ordered data. But guava's TreeMultiMap copying mi= ssed that optimization and make it ~10 times slower than it actually need t= o be on our size of cluster. > The patch attached to the issue replace the reverse TreeMultiMap to= a vanilla TreeMap> in SortedBiMultiValueMap to make sure we = can copy it O(N) time. > I also attached a benchmark script (TokenMetaDataBenchmark.java), which s= imulates a large cluster then measures average latency for TokenMetaData ca= che populating. > Benchmark result before and after that patch: > {code:java} > trunk:=20 > before 100ms, after 13ms > 3.0.x:=20 > before 199ms, after 15ms > =C2=A0{code} > (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(= N) optimization is not applied because the key comparator is dynamically cr= eated and TreeMap cannot determine the source and dest are in same order) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org