Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8BD2718ECA for ; Tue, 24 Nov 2015 20:06:12 +0000 (UTC) Received: (qmail 81786 invoked by uid 500); 24 Nov 2015 20:06:09 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 81744 invoked by uid 500); 24 Nov 2015 20:06:09 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 81734 invoked by uid 99); 24 Nov 2015 20:06:09 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2015 20:06:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CB871C0E02 for ; Tue, 24 Nov 2015 20:06:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.899 X-Spam-Level: ** X-Spam-Status: No, score=2.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id TFiZBIo-c_ZC for ; Tue, 24 Nov 2015 20:06:03 +0000 (UTC) Received: from mail-vk0-f44.google.com (mail-vk0-f44.google.com [209.85.213.44]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 4F65C20C40 for ; Tue, 24 Nov 2015 20:06:02 +0000 (UTC) Received: by vkha189 with SMTP id a189so19889692vkh.2 for ; Tue, 24 Nov 2015 12:06:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=VvAcD7XvSXLFt2IRqZDNq/z1ViUwLDIMic+BGkVy3Ng=; b=vqoFeWu7+Wdhf/aBML48MAJBAfp5cbq3WqkXtI2Kx1gHINbmLNHGP3gV0AUTgGv3wJ ztCeTKXZcNzrZu+ja23CDUFV9zzItcnmsnzoMRbwTN8xrtgP4cgVv9cD4befxa5I2eHt W/Fdp84Y6S52Q20c0gPkPJGi31v8jGBArPaCyQ/FHV3IKAbCca5ExW9pBrO3hzS+KKv7 tqAfZ96CoJsUFO0Q20goOTlxo+JompZynnPiCWMmeosdvTwPRmmgYfKtoux1CSo+yqJF bINbtGyqORdxidRxQtrSxHKrYmGuP0gKIH7rkZF6qGIG2XHjOyThwFqvSjcYXafOk5XU kZyw== MIME-Version: 1.0 X-Received: by 10.31.52.199 with SMTP id b190mr28283529vka.145.1448395561243; Tue, 24 Nov 2015 12:06:01 -0800 (PST) Received: by 10.31.47.137 with HTTP; Tue, 24 Nov 2015 12:06:01 -0800 (PST) In-Reply-To: <1274786933.10050110.1448387599589.JavaMail.yahoo@mail.yahoo.com> References: <1274786933.10050110.1448387599589.JavaMail.yahoo@mail.yahoo.com> Date: Tue, 24 Nov 2015 15:06:01 -0500 Message-ID: Subject: Re: Many keyspaces pattern From: Jack Krupansky To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a1143f936ec197405254edea2 --001a1143f936ec197405254edea2 Content-Type: text/plain; charset=UTF-8 And DateTieredCompactionStrategy can be used to efficiently remove whole sstables when the TTL expires, but this implies knowing what TTL to set in advance. I don't know if there are any tools to bulk delete older than a specific age when DateTieredCompactionStrategy is used, but it might be a nice feature. -- Jack Krupansky On Tue, Nov 24, 2015 at 12:53 PM, Saladi Naidu wrote: > I can think of following features to solve > > 1. If you know the time period of after how long data should be removed > then use TTL feature > 2. Use Time Series to model the data and use inverted index to query the > data by time period? > > Naidu Saladi > > > > On Tuesday, November 24, 2015 6:49 AM, Jack Krupansky < > jack.krupansky@gmail.com> wrote: > > > How often is sometimes - closer to 20% of the batches or 2%? > > How are you querying batches, both current and older ones? > > As always, your queries should drive your data models. > > If deleting a batch is very infrequent, maybe best to not do it and simply > have logic in the app to ignore deleted batches - if your queries would > reference them at all. > > What reasons would you have to delete a batch? Depending on the nature of > the reason there may be an alternative. > > Make sure your cluster is adequately provisioned so that these expensive > operations can occur in parallel to reduce their time and resources per > node. > > Do all batches eventually get aged and deleted or are you expecting that > most batches will live for many years to come? Have you planned for how you > will grow the cluster over time? > > Maybe bite the bullet and use a background process to delete a batch if > deletion is competing too heavily with query access - if they really need > to be deleted at all. > > Number of keyspaces - and/or tables - should be limited to "low hundreds", > and even then you are limited by RAM and CPU of each node. If a keyspace > has 14 tables, then 250/14 = 20 would be a recommended upper limit for > number of key spaces. Even if your total number of tables was under 300 or > even 200, you would need to do a proof of concept implementation to verify > that your specific data works well on your specific hardware. > > > -- Jack Krupansky > > On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet > wrote: > > Hi, > > we are running an application which produces every night a batch with > several hundreds of Gigabytes of data. Once a batch has been computed, it > is never modified (nor updates nor deletes), we just keep producing new > batches every day. > > Now, we are *sometimes* interested to remove a complete specific batch > altogether. At the moment, we are accumulating all these data into only one > keyspace which has a batch ID column in all our tables which is also part > of the primary key. A sample table looks similar to this: > > CREATE TABLE computation_results ( > batch_id int, > id1 int, > id2 int, > value double, > PRIMARY KEY ((batch_id, id1), id2) > ) WITH CLUSTERING ORDER BY (id2 ASC); > > But we found out it is very difficult to remove a specific batch as we > need to know all the IDs to delete the entries and it's both time and > resource consuming (ie. it takes a long time and I'm not sure it's going to > scale at all.) > > So, we are currently looking into having each of our batches in a keyspace > of their own so removing a batch is merely equivalent to delete a keyspace. > Potentially, it means we will end up having several hundreds of keyspaces > in one cluster, although most of the time only the very last one will be > used (we might still want to access the older ones, but that would be a > very seldom use-case.) At the moment, the keyspace has about 14 tables and > is probably not going to evolve much. > > > Are there any counter-indications of using lot of keyspaces (300+) into > one Cassandra cluster? > Are there any good practices that we should follow? > After reading the "Anti-patterns in Cassandra > Too many keyspaces or > tables", does it mean we should plan ahead to already split our keyspace > among several clusters? > > Finally, would there be any other way to achieve what we want to do? > > Thanks for your help! > > Jonathan > > > > > --001a1143f936ec197405254edea2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
And DateTieredCompactionStrategy can be used to efficientl= y remove whole sstables when the TTL expires, but this implies knowing what= TTL to set in advance.

I don't know if there ar= e any tools to bulk delete older than a specific age when=C2=A0DateTieredCo= mpactionStrategy is used, but it might be a nice feature.

=
-- Jack Krupansky

On Tue, Nov 24, 2015 at 12:53 PM, Saladi Nai= du <naidusp2002@yahoo.com> wrote:
I can think of following features to solve

1. If you know the time per= iod of after how long data should be removed then use TTL feature
2. Use Time Series to model the data and use inverted index to= query the data by time period?
=C2=A0
Naidu Saladi



On Tuesday, = November 24, 2015 6:49 AM, Jack Krupansky <jack.krupansky@gmail.com> wrote:


How often is somet= imes - closer to 20% of the batches or 2%?

How are you querying batches, both current and older ones?

As always, your queries should drive your data mo= dels.

If deleting a batch is very i= nfrequent, maybe best to not do it and simply have logic in the app to igno= re deleted batches - if your queries would reference them at all.

What reasons would you have to delete a batc= h? Depending on the nature of the reason there may be an alternative.
=

Make sure your cluster is adequately pro= visioned so that these expensive operations can occur in parallel to reduce= their time and resources per node.

Do all batches eventually get aged and deleted or are you expecting that m= ost batches will live for many years to come? Have you planned for how you = will grow the cluster over time?

Ma= ybe bite the bullet and use a background process to delete a batch if delet= ion is competing too heavily with query access - if they really need to be = deleted at all.

Number of keyspaces= - and/or tables - should be limited to "low hundreds", and even = then you are limited by RAM and CPU of each node. If a keyspace has 14 tabl= es, then 250/14 =3D 20 would be a recommended upper limit for number of key= spaces. Even if your total number of tables was under 300 or even 200, you= would need to do a proof of concept implementation to verify that your spe= cific data works well on your specific hardware.


-- Jack Kr= upansky

On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ball= et <jballet@edgelab.ch> wro= te:
Hi,

we are running an application which produces every night a batch with sever= al hundreds of Gigabytes of data. Once a batch has been computed, it is nev= er modified (nor updates nor deletes), we just keep producing new batches e= very day.

Now, we are *sometimes* interested to remove a complete specific batch alto= gether. At the moment, we are accumulating all these data into only one key= space which has a batch ID column in all our tables which is also part of t= he primary key. A sample table looks similar to this:

=C2=A0 CREATE TABLE computation_results (
=C2=A0 =C2=A0 =C2=A0 batch_id int,
=C2=A0 =C2=A0 =C2=A0 id1 int,
=C2=A0 =C2=A0 =C2=A0 id2 int,
=C2=A0 =C2=A0 =C2=A0 value double,
=C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((batch_id, id1), id2)
=C2=A0 ) WITH CLUSTERING ORDER BY (id2 ASC);

But we found out it is very difficult to remove a specific batch as we need= to know all the IDs to delete the entries and it's both time and resou= rce consuming (ie. it takes a long time and I'm not sure it's going= to scale at all.)

So, we are currently looking into having each of our batches in a keyspace = of their own so removing a batch is merely equivalent to delete a keyspace.= Potentially, it means we will end up having several hundreds of keyspaces = in one cluster, although most of the time only the very last one will be us= ed (we might still want to access the older ones, but that would be a very = seldom use-case.) At the moment, the keyspace has about 14 tables and is pr= obably not going to evolve much.


Are there any counter-indications of using lot of keyspaces (300+) into one= Cassandra cluster?
Are there any good practices that we should follow?
After reading the "Anti-patterns in Cassandra > Too many keyspaces = or tables", does it mean we should plan ahead to already split our key= space among several clusters?

Finally, would there be any other way to achieve what we want to do?

Thanks for your help!

=C2=A0Jonathan



=
--001a1143f936ec197405254edea2--