Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0CD2010837 for ; Tue, 11 Jun 2013 06:38:25 +0000 (UTC) Received: (qmail 11131 invoked by uid 500); 11 Jun 2013 06:38:20 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 10587 invoked by uid 500); 11 Jun 2013 06:38:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 10335 invoked by uid 99); 11 Jun 2013 06:38:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jun 2013 06:38:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of arodrime@gmail.com designates 209.85.215.50 as permitted sender) Received: from [209.85.215.50] (HELO mail-la0-f50.google.com) (209.85.215.50) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jun 2013 06:38:15 +0000 Received: by mail-la0-f50.google.com with SMTP id dy20so4262293lab.37 for ; Mon, 10 Jun 2013 23:37:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=KWk9fWCK1pu3Zpu9IJ1vK57FlU5VzAs5J8bdKK21I58=; b=uXEpmPvRii731S/vTnJ1HgV5lzVuuPFelNzdwMS9aCbCZTfI/xRtac+zh01k40orEZ NDxmMFqUFPrBSNf9faIpqkhAKPi58LKHW7leK7doFwCrzuZWQdw6jZcWL/gFoHPX238B 5Gubjvp0ibfewdDpAH2Ex6D1ri6s+e3lyq39MXgZk++aSR/3yCmgmT7QgnzaoDeGGs+E hZNytOzKmpcGBktmF3nswCjAoB/GB2Kw3rYCqkWsIWJR1vSUd85laGOqjD3Edh2GLpS5 ObwpBayRdbKC7witNBKqLJpoGX2zxd4CjG7wzD8iRhzGy5WYVKsEgvptjZ1vZ09qD7G0 RmsA== X-Received: by 10.112.170.166 with SMTP id an6mr5198191lbc.22.1370932673453; Mon, 10 Jun 2013 23:37:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.7.168 with HTTP; Mon, 10 Jun 2013 23:37:33 -0700 (PDT) In-Reply-To: References: From: Alain RODRIGUEZ Date: Tue, 11 Jun 2013 08:37:33 +0200 Message-ID: Subject: Re: Why so many vnodes? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c373600323c904dedb2305 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c373600323c904dedb2305 Content-Type: text/plain; charset=ISO-8859-1 I think he actually meant *increase*, for this reason "For small T, a random choice of initial tokens will in most cases give a poor distribution of data. The larger T is, the closer to uniform the distribution will be, with increasing probability." Alain 2013/6/11 Theo Hultberg > thanks, that makes sense, but I assume in your last sentence you mean > decrease it for large clusters, not increase it? > > T# > > > On Mon, Jun 10, 2013 at 11:02 PM, Richard Low wrote: > >> Hi Theo, >> >> The number (let's call it T and the number of nodes N) 256 was chosen to >> give good load balancing for random token assignments for most cluster >> sizes. For small T, a random choice of initial tokens will in most cases >> give a poor distribution of data. The larger T is, the closer to uniform >> the distribution will be, with increasing probability. >> >> Also, for small T, when a new node is added, it won't have many ranges to >> split so won't be able to take an even slice of the data. >> >> For this reason T should be large. But if it is too large, there are too >> many slices to keep track of as you say. The function to find which keys >> live where becomes more expensive and operations that deal with individual >> vnodes e.g. repair become slow. (An extreme example is SELECT * LIMIT 1, >> which when there is no data has to scan each vnode in turn in search of a >> single row. This is O(NT) and for even quite small T takes seconds to >> complete.) >> >> So 256 was chosen to be a reasonable balance. I don't think most users >> will find it too slow; users with extremely large clusters may need to >> increase it. >> >> Richard. >> >> >> On 10 June 2013 18:55, Theo Hultberg wrote: >> >>> I'm not sure I follow what you mean, or if I've misunderstood what >>> Cassandra is telling me. Each node has 256 vnodes (or tokens, as the >>> prefered name seems to be). When I run `nodetool status` each node is >>> reported as having 256 vnodes, regardless of how many nodes are in the >>> cluster. A single node cluster has 256 vnodes on the single node, a six >>> node cluster has 256 nodes on each machine, making 1590 vnodes in total. >>> When I run `SELECT tokens FROM system.peers` or `nodetool ring` each node >>> lists 256 tokens. >>> >>> This is different from how it works in Riak and Voldemort, if I'm not >>> mistaken, and that is the source of my confusion. >>> >>> T# >>> >>> >>> On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh wrote: >>> >>>> There are n vnodes regardless of the size of the physical cluster. >>>> Regards >>>> Milind >>>> On Jun 10, 2013 7:48 AM, "Theo Hultberg" wrote: >>>> >>>>> Hi, >>>>> >>>>> The default number of vnodes is 256, is there any significance in this >>>>> number? Since Cassandra's vnodes don't work like for example Riak's, where >>>>> there is a fixed number of vnodes distributed evenly over the nodes, why so >>>>> many? Even with a moderately sized cluster you get thousands of slices. >>>>> Does this matter? If your cluster grows to over thirty machines and you >>>>> start looking at ten thousand slices, would that be a problem? I guess trat >>>>> traversing a list of a thousand or ten thousand slices to find where a >>>>> token lives isn't a huge problem, but are there any other up or downsides >>>>> to having a small or large number of vnodes per node? >>>>> >>>>> I understand the benefits for splitting up the ring into pieces, for >>>>> example to be able to stream data from more nodes when bootstrapping a new >>>>> one, but that works even if each node only has say 32 vnodes (unless your >>>>> cluster is truly huge). >>>>> >>>>> yours, >>>>> Theo >>>>> >>>> >>> >> > --001a11c373600323c904dedb2305 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I think he actually meant *increase*, for this reason &quo= t;For small T, = a random choice of initial tokens will in most cases give a poor distributi= on of data. =A0The larger T is, the closer to uniform the distribution will= be, with increasing probability."

Al= ain


2013/6/11 Theo Hultberg <theo@iconara.net>
thanks, that makes sense, but I assume in your last senten= ce you mean decrease it for large clusters, not increase it?

T#
<= /div>


=
On Mon, Jun 10, 2013 at 11:02 PM, Richard Low <richard@wentnet.com&g= t; wrote:
Hi Theo,

The number (let's call it = T and the number of nodes N) 256 was chosen to give good load balancing for= random token assignments for most cluster sizes. =A0For small T, a random = choice of initial tokens will in most cases give a poor distribution of dat= a. =A0The larger T is, the closer to uniform the distribution will be, with= increasing probability.

Also, for small T, when a new node is added, it won'= ;t have many ranges to split so won't be able to take an even slice of = the data.

For this reason T should be large. =A0Bu= t if it is too large, there are too many slices to keep track of as you say= . =A0The function to find which keys live where becomes more expensive and = operations that deal with individual vnodes e.g. repair become slow. =A0(An= extreme example is SELECT * LIMIT 1, which when there is no data has to sc= an each vnode in turn in search of a single row. =A0This is O(NT) and for e= ven quite small T takes seconds to complete.)

So 256 was chosen to be a reasonable balance. =A0I don&= #39;t think most users will find it too slow; users with extremely large cl= usters may need to increase it.

Richard.

=
On 10 June 2013 18:55, Theo Hultberg <theo@i= conara.net> wrote:
I'm not sure I follow w= hat you mean, or if I've misunderstood what Cassandra is telling me. Ea= ch node has 256 vnodes (or tokens, as the prefered name seems to be). When = I run `nodetool status` each node is reported as having 256 vnodes, regardl= ess of how many nodes are in the cluster. A single node cluster has 256 vno= des on the single node, a six node cluster has 256 nodes on each machine, m= aking 1590 vnodes in total. When I run `SELECT tokens FROM system.peers` or= `nodetool ring` each node lists 256 tokens.

This is different from how it works in Riak and Voldemort, i= f I'm not mistaken, and that is the source of my confusion.
=

T#


On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh <milindparikh@gmail.co= m> wrote:

There are n vnodes regardless of the size of the physical cl= uster.
Regards
Milind

On Jun 10, 2013 7:48 AM, "Theo Hultberg&quo= t; <theo@iconara.n= et> wrote:
Hi,

The default number of vnodes = is 256, is there any significance in this number? Since Cassandra's vno= des don't work like for example Riak's, where there is a fixed numb= er of vnodes distributed evenly over the nodes, why so many? Even with a mo= derately sized cluster you get thousands of slices. Does this matter? If yo= ur cluster grows to over thirty machines and you start looking at ten thous= and slices, would that be a problem? I guess trat traversing a list of a th= ousand or ten thousand slices to find where a token lives isn't a huge = problem, but are there any other up or downsides to having a small or large= number of vnodes per node?

I understand the benefits for splitting up the ring into pie= ces, for example to be able to stream data from more nodes when bootstrappi= ng a new one, but that works even if each node only has say 32 vnodes (unle= ss your cluster is truly huge).

yours,
Theo




--001a11c373600323c904dedb2305--