Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 947F6186B3 for ; Tue, 24 Nov 2015 12:49:18 +0000 (UTC) Received: (qmail 39281 invoked by uid 500); 24 Nov 2015 12:49:15 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 39239 invoked by uid 500); 24 Nov 2015 12:49:15 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 39229 invoked by uid 99); 24 Nov 2015 12:49:15 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2015 12:49:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id CB75D180992 for ; Tue, 24 Nov 2015 12:49:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id US0p8DYBFEtm for ; Tue, 24 Nov 2015 12:49:08 +0000 (UTC) Received: from mail-vk0-f51.google.com (mail-vk0-f51.google.com [209.85.213.51]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 74237439F8 for ; Tue, 24 Nov 2015 12:49:08 +0000 (UTC) Received: by vkha189 with SMTP id a189so10336363vkh.2 for ; Tue, 24 Nov 2015 04:49:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ek617Pcfy4MbLG1aSfFWi4c73CuAgDos9ehzNnsUdcg=; b=VEHE2/n6t390IqxC2BPSvvFjI4pj7YyH4EJC5q05d3uQPl25jr/Zrz506/+nTXxmGm pahx13XgfHMzexhF4mI4k45NthLyQRFACxlQLl6iAGgbEzQTMjsmJxA/w9/NCdcGdDFT CC+Ci9LByxxZVjQn0Xr6O+cd9dv/RCItN8xXFz8rSwpO5bjNqfUFDeBmIN44qHLnvMgN 6EsHPEsfEdQyYxQ7UX4aOQinsitc2NZkUqVgo9Y2nJFqKaDvGtZe5wGxVy3G16Csfy5W 3H0yhemBMCOlAKwI3sU9TPvEecqTpbNw1xZ/0bCfUXklCgMQXDk7I5wJ/WOsz3fW1CEW 9Jig== MIME-Version: 1.0 X-Received: by 10.31.16.197 with SMTP id 66mr25640096vkq.143.1448369342104; Tue, 24 Nov 2015 04:49:02 -0800 (PST) Received: by 10.31.47.137 with HTTP; Tue, 24 Nov 2015 04:49:02 -0800 (PST) In-Reply-To: <5654366A.1060204@edgelab.ch> References: <5654366A.1060204@edgelab.ch> Date: Tue, 24 Nov 2015 07:49:02 -0500 Message-ID: Subject: Re: Many keyspaces pattern From: Jack Krupansky To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a1143637823d21d052548c4da --001a1143637823d21d052548c4da Content-Type: text/plain; charset=UTF-8 How often is sometimes - closer to 20% of the batches or 2%? How are you querying batches, both current and older ones? As always, your queries should drive your data models. If deleting a batch is very infrequent, maybe best to not do it and simply have logic in the app to ignore deleted batches - if your queries would reference them at all. What reasons would you have to delete a batch? Depending on the nature of the reason there may be an alternative. Make sure your cluster is adequately provisioned so that these expensive operations can occur in parallel to reduce their time and resources per node. Do all batches eventually get aged and deleted or are you expecting that most batches will live for many years to come? Have you planned for how you will grow the cluster over time? Maybe bite the bullet and use a background process to delete a batch if deletion is competing too heavily with query access - if they really need to be deleted at all. Number of keyspaces - and/or tables - should be limited to "low hundreds", and even then you are limited by RAM and CPU of each node. If a keyspace has 14 tables, then 250/14 = 20 would be a recommended upper limit for number of key spaces. Even if your total number of tables was under 300 or even 200, you would need to do a proof of concept implementation to verify that your specific data works well on your specific hardware. -- Jack Krupansky On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet wrote: > Hi, > > we are running an application which produces every night a batch with > several hundreds of Gigabytes of data. Once a batch has been computed, it > is never modified (nor updates nor deletes), we just keep producing new > batches every day. > > Now, we are *sometimes* interested to remove a complete specific batch > altogether. At the moment, we are accumulating all these data into only one > keyspace which has a batch ID column in all our tables which is also part > of the primary key. A sample table looks similar to this: > > CREATE TABLE computation_results ( > batch_id int, > id1 int, > id2 int, > value double, > PRIMARY KEY ((batch_id, id1), id2) > ) WITH CLUSTERING ORDER BY (id2 ASC); > > But we found out it is very difficult to remove a specific batch as we > need to know all the IDs to delete the entries and it's both time and > resource consuming (ie. it takes a long time and I'm not sure it's going to > scale at all.) > > So, we are currently looking into having each of our batches in a keyspace > of their own so removing a batch is merely equivalent to delete a keyspace. > Potentially, it means we will end up having several hundreds of keyspaces > in one cluster, although most of the time only the very last one will be > used (we might still want to access the older ones, but that would be a > very seldom use-case.) At the moment, the keyspace has about 14 tables and > is probably not going to evolve much. > > > Are there any counter-indications of using lot of keyspaces (300+) into > one Cassandra cluster? > Are there any good practices that we should follow? > After reading the "Anti-patterns in Cassandra > Too many keyspaces or > tables", does it mean we should plan ahead to already split our keyspace > among several clusters? > > Finally, would there be any other way to achieve what we want to do? > > Thanks for your help! > > Jonathan > --001a1143637823d21d052548c4da Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
How often is sometimes - closer to 20% of the batches or 2= %?

How are you querying batches, both current and older = ones?

As always, your queries should drive your da= ta models.

If deleting a batch is very infrequent,= maybe best to not do it and simply have logic in the app to ignore deleted= batches - if your queries would reference them at all.

What reasons would you have to delete a batch? Depending on the natur= e of the reason there may be an alternative.

Make = sure your cluster is adequately provisioned so that these expensive operati= ons can occur in parallel to reduce their time and resources per node.

Do all batches eventually get aged and deleted or are = you expecting that most batches will live for many years to come? Have you = planned for how you will grow the cluster over time?

Maybe bite the bullet and use a background process to delete a batch if = deletion is competing too heavily with query access - if they really need t= o be deleted at all.

Number of keyspaces - and/or = tables - should be limited to "low hundreds", and even then you a= re limited by RAM and CPU of each node. If a keyspace has 14 tables, then 2= 50/14 =3D 20 would be a recommended upper limit for number of key spaces. E= ven if your total number of tables was under 300 or even 200, you would nee= d to do a proof of concept implementation to verify that your specific data= works well on your specific hardware.


-- Jack Krupansky

On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ba= llet <jballet@edgelab.ch> wrote:
Hi,

we are running an application which produces every night a batch with sever= al hundreds of Gigabytes of data. Once a batch has been computed, it is nev= er modified (nor updates nor deletes), we just keep producing new batches e= very day.

Now, we are *sometimes* interested to remove a complete specific batch alto= gether. At the moment, we are accumulating all these data into only one key= space which has a batch ID column in all our tables which is also part of t= he primary key. A sample table looks similar to this:

=C2=A0 CREATE TABLE computation_results (
=C2=A0 =C2=A0 =C2=A0 batch_id int,
=C2=A0 =C2=A0 =C2=A0 id1 int,
=C2=A0 =C2=A0 =C2=A0 id2 int,
=C2=A0 =C2=A0 =C2=A0 value double,
=C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((batch_id, id1), id2)
=C2=A0 ) WITH CLUSTERING ORDER BY (id2 ASC);

But we found out it is very difficult to remove a specific batch as we need= to know all the IDs to delete the entries and it's both time and resou= rce consuming (ie. it takes a long time and I'm not sure it's going= to scale at all.)

So, we are currently looking into having each of our batches in a keyspace = of their own so removing a batch is merely equivalent to delete a keyspace.= Potentially, it means we will end up having several hundreds of keyspaces = in one cluster, although most of the time only the very last one will be us= ed (we might still want to access the older ones, but that would be a very = seldom use-case.) At the moment, the keyspace has about 14 tables and is pr= obably not going to evolve much.


Are there any counter-indications of using lot of keyspaces (300+) into one= Cassandra cluster?
Are there any good practices that we should follow?
After reading the "Anti-patterns in Cassandra > Too many keyspaces = or tables", does it mean we should plan ahead to already split our key= space among several clusters?

Finally, would there be any other way to achieve what we want to do?

Thanks for your help!

=C2=A0Jonathan

--001a1143637823d21d052548c4da--