Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF40D108C0 for ; Mon, 10 Jun 2013 21:03:30 +0000 (UTC) Received: (qmail 95911 invoked by uid 500); 10 Jun 2013 21:03:28 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95889 invoked by uid 500); 10 Jun 2013 21:03:28 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95881 invoked by uid 99); 10 Jun 2013 21:03:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jun 2013 21:03:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.128.169] (HELO mail-ve0-f169.google.com) (209.85.128.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jun 2013 21:03:23 +0000 Received: by mail-ve0-f169.google.com with SMTP id m1so5266142ves.28 for ; Mon, 10 Jun 2013 14:02:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:from:date :message-id:subject:to:content-type:x-gm-message-state; bh=saug1rjy1eD+dIMWkJ1dhjD351FIKdpTtrH7tVlEfzA=; b=TtrdsZjjc826CaYTyo6tU4e+9o7zAZC6Px7Dtjaa0AnMldOUI+6mySU0XYa30qI0Pr CAS0TigwArTiz83joEs1jV2zpyB8iIUoiOUx57yFCuKJ73pl0U34WRqWnQ1f3iDAyk2Q vdALNv0QxvW654Zs+DcTGQLV5Cvb0dHb4z3OTLN0qWm8z1M09rR66E3rPgtYgfAFWU2q kyIH1/z3adyRVdJQr3lQzzTyOnjuAhj7gJbuqCqKV9NPX/3OHLo3xj5i6EvKHuH+Pc/D oAEyl0/VYHVohzqj7y32wvhEqqEXtpTgB5alJDuQ6D2em4yXeDUlTg0qkQdi70vYY9wT TIWg== X-Received: by 10.52.165.99 with SMTP id yx3mr5667241vdb.34.1370898161688; Mon, 10 Jun 2013 14:02:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.210.40 with HTTP; Mon, 10 Jun 2013 14:02:20 -0700 (PDT) X-Originating-IP: [2.220.236.8] In-Reply-To: References: From: Richard Low Date: Mon, 10 Jun 2013 22:02:20 +0100 Message-ID: Subject: Re: Why so many vnodes? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c2473af3746904ded3193b X-Gm-Message-State: ALoCoQmOef47k/6n2fUKu/0MpCsiSU2/taHTGa3Y4gSyEx7sdgsG+9qge+nofFAV93m3x2eosdx7 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2473af3746904ded3193b Content-Type: text/plain; charset=ISO-8859-1 Hi Theo, The number (let's call it T and the number of nodes N) 256 was chosen to give good load balancing for random token assignments for most cluster sizes. For small T, a random choice of initial tokens will in most cases give a poor distribution of data. The larger T is, the closer to uniform the distribution will be, with increasing probability. Also, for small T, when a new node is added, it won't have many ranges to split so won't be able to take an even slice of the data. For this reason T should be large. But if it is too large, there are too many slices to keep track of as you say. The function to find which keys live where becomes more expensive and operations that deal with individual vnodes e.g. repair become slow. (An extreme example is SELECT * LIMIT 1, which when there is no data has to scan each vnode in turn in search of a single row. This is O(NT) and for even quite small T takes seconds to complete.) So 256 was chosen to be a reasonable balance. I don't think most users will find it too slow; users with extremely large clusters may need to increase it. Richard. On 10 June 2013 18:55, Theo Hultberg wrote: > I'm not sure I follow what you mean, or if I've misunderstood what > Cassandra is telling me. Each node has 256 vnodes (or tokens, as the > prefered name seems to be). When I run `nodetool status` each node is > reported as having 256 vnodes, regardless of how many nodes are in the > cluster. A single node cluster has 256 vnodes on the single node, a six > node cluster has 256 nodes on each machine, making 1590 vnodes in total. > When I run `SELECT tokens FROM system.peers` or `nodetool ring` each node > lists 256 tokens. > > This is different from how it works in Riak and Voldemort, if I'm not > mistaken, and that is the source of my confusion. > > T# > > > On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh wrote: > >> There are n vnodes regardless of the size of the physical cluster. >> Regards >> Milind >> On Jun 10, 2013 7:48 AM, "Theo Hultberg" wrote: >> >>> Hi, >>> >>> The default number of vnodes is 256, is there any significance in this >>> number? Since Cassandra's vnodes don't work like for example Riak's, where >>> there is a fixed number of vnodes distributed evenly over the nodes, why so >>> many? Even with a moderately sized cluster you get thousands of slices. >>> Does this matter? If your cluster grows to over thirty machines and you >>> start looking at ten thousand slices, would that be a problem? I guess trat >>> traversing a list of a thousand or ten thousand slices to find where a >>> token lives isn't a huge problem, but are there any other up or downsides >>> to having a small or large number of vnodes per node? >>> >>> I understand the benefits for splitting up the ring into pieces, for >>> example to be able to stream data from more nodes when bootstrapping a new >>> one, but that works even if each node only has say 32 vnodes (unless your >>> cluster is truly huge). >>> >>> yours, >>> Theo >>> >> > --001a11c2473af3746904ded3193b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Theo,

The number (let's ca= ll it T and the number of nodes N) 256 was chosen to give good load balanci= ng for random token assignments for most cluster sizes. =A0For small T, a r= andom choice of initial tokens will in most cases give a poor distribution = of data. =A0The larger T is, the closer to uniform the distribution will be= , with increasing probability.

Also, for small T, when a new node is added= , it won't have many ranges to split so won't be able to take an ev= en slice of the data.

For this reason = T should be large. =A0But if it is too large, there are too many slices to = keep track of as you say. =A0The function to find which keys live where bec= omes more expensive and operations that deal with individual vnodes e.g. re= pair become slow. =A0(An extreme example is SELECT * LIMIT 1, which when th= ere is no data has to scan each vnode in turn in search of a single row. = =A0This is O(NT) and for even quite small T takes seconds to complete.)

So 256 was chosen to be a reasonable balanc= e. =A0I don't think most users will find it too slow; users with extrem= ely large clusters may need to increase it.

Richard.


On 10 June 2013 18:55, Theo Hultberg <theo@iconara.net> wrote:
I'm not sure I follow w= hat you mean, or if I've misunderstood what Cassandra is telling me. Ea= ch node has 256 vnodes (or tokens, as the prefered name seems to be). When = I run `nodetool status` each node is reported as having 256 vnodes, regardl= ess of how many nodes are in the cluster. A single node cluster has 256 vno= des on the single node, a six node cluster has 256 nodes on each machine, m= aking 1590 vnodes in total. When I run `SELECT tokens FROM system.peers` or= `nodetool ring` each node lists 256 tokens.

This is different from how it works in Riak and Voldemort, i= f I'm not mistaken, and that is the source of my confusion.

T#


=
On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh <milindparikh@gmail.co= m> wrote:

There are n vnodes regardless of the size of the physical cl= uster.
Regards
Milind

On Jun 10, 2013 7:48 AM, "Theo Hultberg&quo= t; <theo@iconara.n= et> wrote:
Hi,

The default number of vnodes = is 256, is there any significance in this number? Since Cassandra's vno= des don't work like for example Riak's, where there is a fixed numb= er of vnodes distributed evenly over the nodes, why so many? Even with a mo= derately sized cluster you get thousands of slices. Does this matter? If yo= ur cluster grows to over thirty machines and you start looking at ten thous= and slices, would that be a problem? I guess trat traversing a list of a th= ousand or ten thousand slices to find where a token lives isn't a huge = problem, but are there any other up or downsides to having a small or large= number of vnodes per node?

I understand the benefits for splitting up the ring into pie= ces, for example to be able to stream data from more nodes when bootstrappi= ng a new one, but that works even if each node only has say 32 vnodes (unle= ss your cluster is truly huge).

yours,
Theo


--001a11c2473af3746904ded3193b--