Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E39B183CA for ; Tue, 1 Mar 2016 14:13:40 +0000 (UTC) Received: (qmail 2029 invoked by uid 500); 1 Mar 2016 14:13:21 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 1982 invoked by uid 500); 1 Mar 2016 14:13:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 1972 invoked by uid 99); 1 Mar 2016 14:13:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2016 14:13:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 08147C02EF for ; Tue, 1 Mar 2016 14:13:20 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=wealth-port.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id b5dvuD4duZfQ for ; Tue, 1 Mar 2016 14:13:17 +0000 (UTC) Received: from mail-wm0-f49.google.com (mail-wm0-f49.google.com [74.125.82.49]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id EFB6E5FAE2 for ; Tue, 1 Mar 2016 14:13:04 +0000 (UTC) Received: by mail-wm0-f49.google.com with SMTP id p65so37102743wmp.1 for ; Tue, 01 Mar 2016 06:13:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wealth-port.com; s=google; h=from:message-id:mime-version:subject:date:references:to:in-reply-to; bh=95c8MfKbheCQbtTsUlLt+j7PxAJugUJvstqtvqeX8ic=; b=l7+IQ7J7WSE6kW7rEDovoVo24aAcvhFqLlc1GrPqV89GQBG0AiybgihpoUZ/cZqH5A cw9T9TRbKrmRFpU0wCPVFYtxAICKKi2gzjyLqWJwvcc0P9V8jtLzBVhcteP0vLvw0WYL oSZ3R3wMo5YJdRurgumXZrCJD5h4pQfr9Ar7M= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:message-id:mime-version:subject:date :references:to:in-reply-to; bh=95c8MfKbheCQbtTsUlLt+j7PxAJugUJvstqtvqeX8ic=; b=kSsA/owLICszsLLNU43NKamDUA1qaNSMnpQfxr0/it/zUn+hUUrxS8NJr9G5c8DLXA 7Jbk1MNKo9uClexKCQsqjPD+Y9Oups8Y/P5WOBb0nGdoqGG2vTRxdI/vgIayqIiC2KEG h5e3CmJd6Lh24Fq/AszgIQJI3BRrQ6GLOqKpSxNNmqDDKT8ibd5oAmJ0ZwCZpTBe+rJd ZTdYyc2VN+LXgmlP6jK17Ekv04HbhVuq3u8gkvX59th6/194/QKTLrTxtsWGdG3G8h1a Rayw0fKtWgTir0H1fbTsBdAyKt++BeANCK49FdkRGQx1VAAbzOGbf7wyYvp4HVp0dHoj pJLQ== X-Gm-Message-State: AD7BkJLFghusSq9PKXpobYUbnxsDYlbFqfALp81tnqQY4hYYOfiAl/ZZLaY0xV1lJawoWlxo X-Received: by 10.28.146.202 with SMTP id u193mr4093686wmd.82.1456841507397; Tue, 01 Mar 2016 06:11:47 -0800 (PST) Received: from [192.168.20.1] ([62.192.18.210]) by smtp.gmail.com with ESMTPSA id ll9sm31107105wjc.29.2016.03.01.06.11.46 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 01 Mar 2016 06:11:46 -0800 (PST) From: Fernando Jimenez Content-Type: multipart/alternative; boundary="Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC" Message-Id: <78A013D0-3EB7-4E62-95EA-F87A26959738@wealth-port.com> Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: Practical limit on number of column families Date: Tue, 1 Mar 2016 15:11:45 +0100 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.3112) --Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Jack Being purposefully developed to only handle up to =E2=80=9Ca few = hundred=E2=80=9D tables is reason enough. I accept that, and likely a = use case with many tables was never really considered. But I would still = like to understand the design choices made so perhaps we gain some = confidence level in this upper limit in the number of tables. The best = estimate we have so far is =E2=80=9Ca few hundred=E2=80=9D which is a = bit vague.=20 Regarding scaling, I=E2=80=99m not talking about scaling in terms of = data volume, but on how the data is structured. One thousand tables with = one row each is the same data volume as one table with one thousand = rows, excluding any data structures required to maintain the extra = tables. But whereas the first seems likely to bring a Cassandra cluster = to its knees, the second will run happily on a single node cluster in a = low end machine. We will design our code to use a single table to avoid having nightmares = with this issue. But if there is any authoritative documentation on this = characteristic of Cassandra, I would love to know more. FJ > On 01 Mar 2016, at 14:23, Jack Krupansky = wrote: >=20 > I don't think there are any "reasons behind it." It is simply = empirical experience - as reported here. >=20 > Cassandra scales in two dimension - number of rows per node and number = of nodes. If some source of information lead you to believe otherwise, = please point out the source so that we can endeavor to correct it. >=20 > The exact number of rows per node and tables per node will always have = to be evaluated empirically - a proof of concept implementation, since = it all depends on the mix of capabilities of your hardware combined with = your specific data model, your specific data values, your specific = access patterns, and your specific load. And it also depends on your own = personal tolerance for degradation of latency and throughput - some = people might find a given set of performance metrics acceptable while = other might not. >=20 > -- Jack Krupansky >=20 > On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez = > wrote: > Hi Tommaso >=20 > It=E2=80=99s not that I _need_ a large number of tables. This approach = maps easily to the problem we are trying to solve, but it=E2=80=99s = becoming clear it=E2=80=99s not the right approach. >=20 > At the moment I=E2=80=99m trying to understand the limitations in = Cassandra regarding number of Tables and the reasons behind it. I=E2=80=99= ve come to the email list as my Google-foo is not giving me what I=E2=80=99= m looking for :( >=20 > FJ >=20 >=20 >=20 >> On 01 Mar 2016, at 09:36, tommaso barbugli > wrote: >>=20 >> Hi Fernando, >>=20 >> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it = was a real pain in terms of operations. Repairs were terribly slow, boot = of C* slowed down and in general tracking table metrics becomes bit more = work. Why do you need this high number of tables? >>=20 >> Tommaso >>=20 >> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez = > wrote: >> Hi Jack >>=20 >> By entry I mean row >>=20 >> Apologies for the =E2=80=9Cobsolete terminology=E2=80=9D. When I = first looked at Cassandra it was still on CQL2, and now that I=E2=80=99m = looking at it again I=E2=80=99ve defaulted to the terms I already knew. = I will bear it in mind and call them tables from now on. >>=20 >> Is there any documentation about this limit? for example, I=E2=80=99d = be keen to know how much memory is consumed per table, and I=E2=80=99m = also curious about the reasons for keeping this in memory. I=E2=80=99m = trying to understand the limitations here, rather than challenge them. >>=20 >> So far I found nothing in my search, hence why I had to resort to = some =E2=80=9Cload testing=E2=80=9D to see what happens when you push = the table count high >>=20 >> Thanks >> FJ >>=20 >>=20 >>> On 01 Mar 2016, at 06:23, Jack Krupansky > wrote: >>>=20 >>> 3,000 entries? What's an "entry"? Do you mean row, column, or... = what? >>>=20 >>> You are using the obsolete terminology of CQL2 and Thrift - column = family. With CQL3 you should be creating "tables". The practical = recommendation of an upper limit of a few hundred tables across all key = spaces remains. >>>=20 >>> Technically you can go higher and technically you can reduce the = overhead per table (an undocumented Jira - intentionally undocumented = since it is strongly not recommended), but... it is unlikely that you = will be happy with the results. >>>=20 >>> What is the nature of the use case? >>>=20 >>> You basically have two choices: an additional cluster column to = distinguish categories of table, or separate clusters for each few = hundred of tables. >>>=20 >>>=20 >>> -- Jack Krupansky >>>=20 >>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez = > wrote: >>> Hi all >>>=20 >>> I have a use case for Cassandra that would require creating a large = number of column families. I have found references to early versions of = Cassandra where each column family would require a fixed amount of = memory on all nodes, effectively imposing an upper limit on the total = number of CFs. I have also seen rumblings that this may have been fixed = in later versions. >>>=20 >>> To put the question to rest, I have setup a DSE sandbox and created = some code to generate column families populated with 3,000 entries each. >>>=20 >>> Unfortunately I have now hit this issue: = https://issues.apache.org/jira/browse/CASSANDRA-9291 = >>>=20 >>> So I will have to retest against Cassandra 3.0 instead >>>=20 >>> However, I would like to understand the limitations regarding = creation of column families.=20 >>>=20 >>> * Is there a practical upper limit?=20 >>> * is this a fixed limit, or does it scale as more nodes are = added into the cluster?=20 >>> * Is there a difference between one keyspace with thousands of = column families, vs thousands of keyspaces with only a few column = families each? >>>=20 >>> I haven=E2=80=99t found any hard evidence/documentation to help me = here, but if you can point me in the right direction, I will oblige and = RTFM away. >>>=20 >>> Many thanks for your help! >>>=20 >>> Cheers >>> FJ >>>=20 >>>=20 >>>=20 >>=20 >>=20 >=20 >=20 --Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi Jack

Being = purposefully developed to only handle up to =E2=80=9Ca few hundred=E2=80=9D= tables is reason enough. I accept that, and likely a use case with many = tables was never really considered. But I would still like to understand = the design choices made so perhaps we gain some confidence level in this = upper limit in the number of tables. The best estimate we have so far is = =E2=80=9Ca few hundred=E2=80=9D which is a bit vague. 

Regarding scaling, I=E2=80= =99m not talking about scaling in terms of data volume, but on how the = data is structured. One thousand tables with one row each is the same = data volume as one table with one thousand rows, excluding any data = structures required to maintain the extra tables. But whereas the first = seems likely to bring a Cassandra cluster to its knees, the second will = run happily on a single node cluster in a low end = machine.

We will design = our code to use a single table to avoid having nightmares with this = issue. But if there is any authoritative documentation on this = characteristic of Cassandra, I would love to know more.

FJ


On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupansky@gmail.com> wrote:

I don't think there are any "reasons behind it." It is simply = empirical experience - as reported here.

Cassandra scales in two dimension - = number of rows per node and number of nodes. If some source of = information lead you to believe otherwise, please point out the source = so that we can endeavor to correct it.

The exact number of rows per node and = tables per node will always have to be evaluated empirically - a proof = of concept implementation, since it all depends on the mix of = capabilities of your hardware combined with your specific data model, = your specific data values, your specific access patterns, and your = specific load. And it also depends on your own personal tolerance for = degradation of latency and throughput - some people might find a given = set of performance  metrics acceptable while other might = not.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 3:54 = AM, Fernando Jimenez <fernando.jimenez@wealth-port.com> wrote:
Hi Tommaso

It=E2=80=99s = not that I _need_ a large number of tables. This approach maps easily to = the problem we are trying to solve, but it=E2=80=99s becoming clear = it=E2=80=99s not the right approach.

At the moment I=E2=80=99m trying to = understand the limitations in Cassandra regarding number of Tables and = the reasons behind it. I=E2=80=99ve come to the email list as my = Google-foo is not giving me what I=E2=80=99m looking for :(

FJ



On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbugli@gmail.com> wrote:

Hi Fernando,

I used to have a cluster with ~300 = tables (1 keyspace) on C* 2.0, it was a real pain in terms of = operations. Repairs were terribly slow, boot of C* slowed down and in = general tracking table metrics becomes bit more work. Why do you need = this high number of tables?

Tommaso

On Tue, = Mar 1, 2016 at 9:16 AM, Fernando Jimenez <fernando.jimenez@wealth-port.com> wrote:
Hi Jack

By entry I mean row

Apologies for the =E2=80=9Cobsolete terminology=E2=80=9D. = When I first looked at Cassandra it was still on CQL2, and now that = I=E2=80=99m looking at it again I=E2=80=99ve defaulted to the terms I = already knew. I will bear it in mind and call them tables from now = on.

Is = there any documentation about this limit? for example, I=E2=80=99d be = keen to know how much memory is consumed per table, and I=E2=80=99m also = curious about the reasons for keeping this in memory. I=E2=80=99m trying = to understand the limitations here, rather than challenge = them.

So far I found = nothing in my search, hence why I had to resort to some =E2=80=9Cload = testing=E2=80=9D to see what happens when you push the table count = high

Thanks
FJ


On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupansky@gmail.com> wrote:

3,000 entries? = What's an "entry"? Do you mean row, column, or... what?

You are using the = obsolete terminology of CQL2 and Thrift - column family. With CQL3 you = should be creating "tables". The practical recommendation of an upper = limit of a few hundred tables across all key spaces remains.

Technically you can go = higher and technically you can reduce the overhead per table (an = undocumented Jira - intentionally undocumented since it is strongly not = recommended), but... it is unlikely that you will be happy with the = results.

What = is the nature of the use case?

You basically have two choices: an = additional cluster column to distinguish categories of table, or = separate clusters for each few hundred of tables.

-- = Jack Krupansky

On Mon, Feb 29, 2016 at 12:30 = PM, Fernando Jimenez <fernando.jimenez@wealth-port.com> wrote:
Hi all

I have a use case for Cassandra that = would require creating a large number of column families. I have found = references to early versions of Cassandra where each column family would = require a fixed amount of memory on all nodes, effectively imposing an = upper limit on the total number of CFs. I have also seen rumblings that = this may have been fixed in later versions.

To put the question to rest, I have = setup a DSE sandbox and created some code to generate column families = populated with 3,000 entries each.

Unfortunately I have now hit this = issue: https://issues.apache.org/jira/browse/CASSANDRA-9291
<= /div>

So I will have to retest = against Cassandra 3.0 instead

However, I would like to understand the = limitations regarding creation of column families. 

* Is there a = practical upper limit? 
* is this a = fixed limit, or does it scale as more nodes are added into the = cluster? 
* Is there a difference between one keyspace = with thousands of column families, vs thousands of keyspaces with only a = few column families each?

I haven=E2=80=99t found any hard evidence/documentation to = help me here, but if you can point me in the right direction, I will = oblige and RTFM away.

Many thanks for your help!

Cheers
FJ








= --Apple-Mail=_61394AFD-B994-4435-A888-973EEEFF3ACC--