cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Kivity <...@scylladb.com>
Subject Re: vnodes: high availability
Date Wed, 17 Jan 2018 12:50:27 GMT
On the flip side, a large number of vnodes is also beneficial. For 
example, if you add a node to a 20-node cluster with many vnodes, each 
existing node will contribute 5% of the data towards the new node, and 
all nodes will participate in streaming (meaning the impact on any 
single node will be limited, and completion time will be faster).


With a low number of vnodes, only a few nodes participate in streaming, 
which means that the cluster is left unbalanced and the impact on each 
streaming node is greater (or that completion time is slower).


Similarly, with a high number of vnodes, if a node is down its work is 
distributed equally among all nodes. With a low number of vnodes the 
cluster becomes unbalanced.


Overall I recommend high vnode count, and to limit the impact of 
failures in other ways (smaller number of large nodes vs. larger number 
of small nodes).


btw, rack-aware topology improves the multi-failure problem but at the 
cost of causing imbalance during maintenance operations. I recommend 
using rack-aware topology only if you really have racks with 
single-points-of-failure, not for other reasons.


On 01/17/2018 05:43 AM, kurt greaves wrote:
> Even with a low amount of vnodes you're asking for a bad time. Even if 
> you managed to get down to 2 vnodes per node, you're still likely to 
> include double the amount of nodes in any streaming/repair operation 
> which will likely be very problematic for incremental repairs, and you 
> still won't be able to easily reason about which nodes are responsible 
> for which token ranges. It's still quite likely that a loss of 2 nodes 
> would mean some portion of the ring is down (at QUORUM). At the moment 
> I'd say steer clear of vnodes and use single tokens if you can; a lot 
> of work still needs to be done to ensure smooth operation of C* while 
> using vnodes, and they are much more difficult to reason about (which 
> is probably the reason no one has bothered to do the math). If you're 
> really keen on the math your best bet is to do it yourself, because 
> it's not a point of interest for many C* devs plus probably a lot of 
> us wouldn't remember enough math to know how to approach it.
>
> If you want to get out of this situation you'll need to do a DC 
> migration to a new DC with a better configuration of 
> snitch/replication strategy/racks/tokens.
>
>
> On 16 January 2018 at 21:54, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com 
> <mailto:Kyrylo_Lebediev@epam.com>> wrote:
>
>     Thank you for this valuable info, Jon.
>     I guess both you and Alex are referring to improved vnodes
>     allocation method
>     https://issues.apache.org/jira/browse/CASSANDRA-7032
>     <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was
>     implemented in 3.0.
>
>     Based on your info and comments in the ticket it's really a bad
>     idea to have small number of vnodes for the versions using old
>     allocation method because of hot-spots, so it's not an option for
>     my particular case (v.2.1) :(
>
>     [As far as I can see from the source code this new method
>     wasn't backported to 2.1.]
>
>
>
>     Regards,
>
>     Kyrill
>
>     [CASSANDRA-7032] Improve vnode allocation - ASF JIRA
>     <https://issues.apache.org/jira/browse/CASSANDRA-7032>
>     issues.apache.org <http://issues.apache.org>
>     It's been known for a little while that random vnode allocation
>     causes hotspots of ownership. It should be possible to improve
>     dramatically on this with deterministic ...
>
>
>     ------------------------------------------------------------------------
>     *From:* Jon Haddad <jonathan.haddad@gmail.com
>     <mailto:jonathan.haddad@gmail.com>> on behalf of Jon Haddad
>     <jon@jonhaddad.com <mailto:jon@jonhaddad.com>>
>     *Sent:* Tuesday, January 16, 2018 8:21:33 PM
>
>     *To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>     *Subject:* Re: vnodes: high availability
>     We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the
>     randomness.  There’s going to be some imbalance, the amount of
>     imbalance depends on luck, unfortunately.
>
>     I’m interested to hear your results using 4 tokens, would you mind
>     letting the ML know your experience when you’ve done it?
>
>     Jon
>
>>     On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev
>>     <Kyrylo_Lebediev@epam.com <mailto:Kyrylo_Lebediev@epam.com>> wrote:
>>
>>     Agree with you, Jon.
>>     Actually, this cluster was configured by my 'predecessor' and
>>     [fortunately for him] we've never met :)
>>     We're using version 2.1.15 and can't upgrade because of legacy
>>     Netflix Astyanax client used.
>>
>>     Below in the thread Alex mentioned that it's recommended to set
>>     vnodes to a value lower than 256 only for C* version > 3.0 (token
>>     allocation algorithm was improved since C* 3.0) .
>>
>>     Jon,
>>     Do you have positive experience setting up cluster with vnodes <
>>     256 for  C* 2.1?
>>
>>     vnodes=32 also too high, as for me (we need to have much more
>>     than 32 servers per AZ in order to to get 'reliable' cluster)
>>     vnodes=4 seems to be better from HA + balancing trade-off
>>
>>     Thanks,
>>     Kyrill
>>     ------------------------------------------------------------------------
>>     *From:*Jon Haddad <jonathan.haddad@gmail.com
>>     <mailto:jonathan.haddad@gmail.com>> on behalf of Jon Haddad
>>     <jon@jonhaddad.com <mailto:jon@jonhaddad.com>>
>>     *Sent:*Tuesday, January 16, 2018 6:44:53 PM
>>     *To:*user
>>     *Subject:*Re: vnodes: high availability
>>     While all the token math is helpful, I have to also call out the
>>     elephant in the room:
>>
>>     You have not correctly configured Cassandra for production.
>>
>>     If you had used the correct endpoint snitch & network topology
>>     strategy, you would be able to withstand the complete failure of
>>     an entire availability zone at QUORUM, or two if you queried at
>>     CL=ONE.
>>
>>     You are correct about 256 tokens causing issues, it’s one of the
>>     reasons why we recommend 32.  I’m curious how things behave going
>>     as low as 4, personally, but I haven’t done the math / tested it yet.
>>
>>
>>
>>>     On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev
>>>     <Kyrylo_Lebediev@epam.com <mailto:Kyrylo_Lebediev@epam.com>>
wrote:
>>>
>>>     ...to me it sounds like 'C* isn't that highly-available by
>>>     design as it's declared'.
>>>     More nodes in a cluster means higher probability of simultaneous
>>>     node failures.
>>>     And from high-availability standpoint, looks like situation is
>>>     made even worse by recommendedsettingvnodes=256.
>>>
>>>     Need to do some math to get numbers/formulas, but now situation
>>>     doesn't seem to be promising.
>>>     In case smb from C* developers/architects is reading this
>>>     message, I'd be grateful to get some links to calculations of C*
>>>     reliability based on which decisions were made.
>>>
>>>     Regards,
>>>     Kyrill
>>>     ------------------------------------------------------------------------
>>>     *From:*kurt greaves <kurt@instaclustr.com
>>>     <mailto:kurt@instaclustr.com>>
>>>     *Sent:*Tuesday, January 16, 2018 2:16:34 AM
>>>     *To:*User
>>>     *Subject:*Re: vnodes: high availability
>>>     Yeah it's very unlikely that you will have 2 nodes in the
>>>     cluster with NO intersecting token ranges (vnodes) for an RF of
>>>     3 (probably even 2).
>>>
>>>     If node A goes down all 256 ranges will go down, and considering
>>>     there are only 49 other nodes all with 256 vnodes each, it's
>>>     very likely that every node will be responsible for some range A
>>>     was also responsible for. I'm not sure what the exact math is,
>>>     but think of it this way: If on each node, any of its 256 token
>>>     ranges overlap (it's within the next RF-1 or previous RF-1 token
>>>     ranges) on the ring with a token range on node A those token
>>>     ranges will be down at QUORUM.
>>>
>>>     Because token range assignment just uses rand() under the hood,
>>>     I'm sure you could prove that it's always going to be the case
>>>     that any 2 nodes going down result in a loss of QUORUM for some
>>>     token range.
>>>
>>>     On 15 January 2018 at 19:59, Kyrylo
>>>     Lebediev<Kyrylo_Lebediev@epam.com
>>>     <mailto:Kyrylo_Lebediev@epam.com>>wrote:
>>>
>>>         Thanks Alexander!
>>>
>>>         I'm not a MS in math too) Unfortunately.
>>>
>>>         Not sure, but it seems to me that probability of 2/49 in
>>>         your explanation doesn't take into account that vnodes
>>>         endpoints are almost evenly distributed across all nodes (al
>>>         least it's what I can see from "nodetool ring" output).
>>>
>>>         http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>>         <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>         of course this vnodes illustration is a theoretical one, but
>>>         there no 2 nodes on that diagram that can be switched off
>>>         without losing a key range (at CL=QUORUM).
>>>
>>>         That's because vnodes_per_node=8 > Nnodes=6.
>>>         As far as I understand, situation is getting worse with
>>>         increase of vnodes_per_node/Nnode ratio.
>>>         Please, correct me if I'm wrong.
>>>
>>>         How would the situation differ from this example by
>>>         DataStax, if we had a real-life 6-nodes cluster with 8
>>>         vnodes on each node?
>>>
>>>         Regards,
>>>         Kyrill
>>>
>>>         ------------------------------------------------------------------------
>>>         *From:*Alexander Dejanovski <alex@thelastpickle.com
>>>         <mailto:alex@thelastpickle.com>>
>>>         *Sent:*Monday, January 15, 2018 8:14:21 PM
>>>
>>>         *To:*user@cassandra.apache.org
>>>         <mailto:user@cassandra.apache.org>
>>>         *Subject:*Re: vnodes: high availability
>>>         I was corrected off list that the odds of losing data when 2
>>>         nodes are down isn't dependent on the number of vnodes, but
>>>         only on the number of nodes.
>>>         The more vnodes, the smaller the chunks of data you may
>>>         lose, and vice versa.
>>>         I officially suck at statistics, as expected :)
>>>
>>>         Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski
>>>         <alex@thelastpickle.com <mailto:alex@thelastpickle.com>>
a
>>>         écrit :
>>>
>>>             Hi Kyrylo,
>>>
>>>             the situation is a bit more nuanced than shown by the
>>>             Datastax diagram, which is fairly theoretical.
>>>             If you're using SimpleStrategy, there is no rack
>>>             awareness. Since vnode distribution is purely random,
>>>             and the replica for a vnode will be placed on the node
>>>             that owns the next vnode in token order (yeah, that's
>>>             not easy to formulate), you end up with statistics only.
>>>
>>>             I kinda suck at maths but I'm going to risk making a
>>>             fool of myself :)
>>>
>>>             The odds for one vnode to be replicated on another node
>>>             are, in your case, 2/49 (out of 49 remaining nodes, 2
>>>             replicas need to be placed).
>>>             Given you have 256 vnodes, the odds for at least one
>>>             vnode of a single node to exist on another one is
>>>             256*(2/49) = 10.4%
>>>             Since the relationship is bi-directional (there are the
>>>             same odds for node B to have a vnode replicated on node
>>>             A than the opposite), that doubles the odds of 2 nodes
>>>             being both replica for at least one vnode : 20.8%.
>>>
>>>             Having a smaller number of vnodes will decrease the
>>>             odds, just as having more nodes in the cluster.
>>>             (now once again, I hope my maths aren't fully wrong, I'm
>>>             pretty rusty in that area...)
>>>
>>>             How many queries that will affect is a different
>>>             question as it depends on which partition currently
>>>             exist and are queried in the unavailable token ranges.
>>>
>>>             Then you have rack awareness that comes with
>>>             NetworkTopologyStrategy :
>>>             If the number of replicas (3 in your case) is
>>>             proportional to the number of racks, Cassandra will
>>>             spread replicas in different ones.
>>>             In that situation, you can theoretically lose as many
>>>             nodes as you want in a single rack, you will still have
>>>             two other replicas available to satisfy quorum in the
>>>             remaining racks.
>>>             If you start losing nodes in different racks, we're back
>>>             to doing maths (but the odds will get slightly different).
>>>
>>>             That makes maintenance predictable because you can shut
>>>             down as many nodes as you want in a single rack without
>>>             losing QUORUM.
>>>
>>>             Feel free to correct my numbers if I'm wrong.
>>>
>>>             Cheers,
>>>
>>>
>>>
>>>
>>>
>>>             On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev
>>>             <Kyrylo_Lebediev@epam.com
>>>             <mailto:Kyrylo_Lebediev@epam.com>> wrote:
>>>
>>>                 Thanks, Rahul.
>>>                 But in your example, at the same time loss of Node3
>>>                 and Node6 leads to loss of ranges N, C, J at
>>>                 consistency level QUORUM.
>>>
>>>                 As far as I understand in case vnodes >
>>>                 N_nodes_in_cluster and endpoint_snitch=SimpleSnitch,
>>>                 since:
>>>
>>>                 1) "secondary" replicas are placed on two nodes
>>>                 'next' to the node responsible for a range (in case
>>>                 of RF=3)
>>>                 2) there are a lot of vnodes on each node
>>>                 3) ranges are evenly distribusted between vnodes in
>>>                 case ofSimpleSnitch,
>>>
>>>                 we get all physical nodes (servers) having mutually
>>>                 adjacent token rages.
>>>                 Is it correct?
>>>
>>>                 At least in case of my real-world ~50-nodes cluster
>>>                 with nvodes=256, RF=3 for this command:
>>>
>>>                 nodetool ring | grep '^<ip-prefix>' | awk '{print
>>>                 $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep
>>>                 -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq |
>>>                 wc -l
>>>
>>>                 returned number which equals to Nnodes -1, what
>>>                 means that I can't switch off 2 nodes at the same
>>>                 time w/o losing of some keyrange for CL=QUORUM.
>>>
>>>                 Thanks,
>>>                 Kyrill
>>>                 ------------------------------------------------------------------------
>>>                 *From:*Rahul Neelakantan <rahul@rahul.be
>>>                 <mailto:rahul@rahul.be>>
>>>                 *Sent:*Monday, January 15, 2018 5:20:20 PM
>>>                 *To:*user@cassandra.apache.org
>>>                 <mailto:user@cassandra.apache.org>
>>>                 *Subject:*Re: vnodes: high availability
>>>                 Not necessarily. It depends on how the token ranges
>>>                 for the vNodes are assigned to them. For example
>>>                 take a look at this diagram
>>>                 http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>>                 <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>
>>>                 In the vNode part of the diagram, you will see that
>>>                 Loss of Node 3 and Node 6, will still not have any
>>>                 effect on Token Range A. But yes if you lose two
>>>                 nodes that both have Token Range A assigned to them
>>>                 (Say Node 1 and Node 2), you will have
>>>                 unavailability with your specified configuration.
>>>
>>>                 You can sort of circumvent this by using the
>>>                 DataStax Java Driver and having the client recognize
>>>                 a degraded cluster and operate temporarily in
>>>                 downgraded consistency mode
>>>
>>>                 http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>>>                 <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>>>
>>>                 - Rahul
>>>
>>>                 On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo
>>>                 Lebediev<Kyrylo_Lebediev@epam.com
>>>                 <mailto:Kyrylo_Lebediev@epam.com>>wrote:
>>>
>>>                     Hi,
>>>
>>>                     Let's say we have a C* cluster with following
>>>                     parameters:
>>>                      - 50 nodes in the cluster
>>>                      - RF=3
>>>                      - vnodes=256 per node
>>>                      - CL for some queries = QUORUM
>>>                      - endpoint_snitch = SimpleSnitch
>>>
>>>                     Is it correct that 2 any nodes down will cause
>>>                     unavailability of a keyrange at CL=QUORUM?
>>>
>>>                     Regards,
>>>                     Kyrill
>>>
>>>
>>>
>>>
>>>             --
>>>             -----------------
>>>             Alexander Dejanovski
>>>             France
>>>             @alexanderdeja
>>>
>>>             Consultant
>>>             Apache Cassandra Consulting
>>>             http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>
>>>         --
>>>         -----------------
>>>         Alexander Dejanovski
>>>         France
>>>         @alexanderdeja
>>>
>>>         Consultant
>>>         Apache Cassandra Consulting
>>>         http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>
>
>


Mime
View raw message