cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Many keyspaces pattern
Date Tue, 24 Nov 2015 20:06:01 GMT
And DateTieredCompactionStrategy can be used to efficiently remove whole
sstables when the TTL expires, but this implies knowing what TTL to set in
advance.

I don't know if there are any tools to bulk delete older than a specific
age when DateTieredCompactionStrategy is used, but it might be a nice
feature.

-- Jack Krupansky

On Tue, Nov 24, 2015 at 12:53 PM, Saladi Naidu <naidusp2002@yahoo.com>
wrote:

> I can think of following features to solve
>
> 1. If you know the time period of after how long data should be removed
> then use TTL feature
> 2. Use Time Series to model the data and use inverted index to query the
> data by time period?
>
> Naidu Saladi
>
>
>
> On Tuesday, November 24, 2015 6:49 AM, Jack Krupansky <
> jack.krupansky@gmail.com> wrote:
>
>
> How often is sometimes - closer to 20% of the batches or 2%?
>
> How are you querying batches, both current and older ones?
>
> As always, your queries should drive your data models.
>
> If deleting a batch is very infrequent, maybe best to not do it and simply
> have logic in the app to ignore deleted batches - if your queries would
> reference them at all.
>
> What reasons would you have to delete a batch? Depending on the nature of
> the reason there may be an alternative.
>
> Make sure your cluster is adequately provisioned so that these expensive
> operations can occur in parallel to reduce their time and resources per
> node.
>
> Do all batches eventually get aged and deleted or are you expecting that
> most batches will live for many years to come? Have you planned for how you
> will grow the cluster over time?
>
> Maybe bite the bullet and use a background process to delete a batch if
> deletion is competing too heavily with query access - if they really need
> to be deleted at all.
>
> Number of keyspaces - and/or tables - should be limited to "low hundreds",
> and even then you are limited by RAM and CPU of each node. If a keyspace
> has 14 tables, then 250/14 = 20 would be a recommended upper limit for
> number of key spaces. Even if your total number of tables was under 300 or
> even 200, you would need to do a proof of concept implementation to verify
> that your specific data works well on your specific hardware.
>
>
> -- Jack Krupansky
>
> On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet <jballet@edgelab.ch>
> wrote:
>
> Hi,
>
> we are running an application which produces every night a batch with
> several hundreds of Gigabytes of data. Once a batch has been computed, it
> is never modified (nor updates nor deletes), we just keep producing new
> batches every day.
>
> Now, we are *sometimes* interested to remove a complete specific batch
> altogether. At the moment, we are accumulating all these data into only one
> keyspace which has a batch ID column in all our tables which is also part
> of the primary key. A sample table looks similar to this:
>
>   CREATE TABLE computation_results (
>       batch_id int,
>       id1 int,
>       id2 int,
>       value double,
>       PRIMARY KEY ((batch_id, id1), id2)
>   ) WITH CLUSTERING ORDER BY (id2 ASC);
>
> But we found out it is very difficult to remove a specific batch as we
> need to know all the IDs to delete the entries and it's both time and
> resource consuming (ie. it takes a long time and I'm not sure it's going to
> scale at all.)
>
> So, we are currently looking into having each of our batches in a keyspace
> of their own so removing a batch is merely equivalent to delete a keyspace.
> Potentially, it means we will end up having several hundreds of keyspaces
> in one cluster, although most of the time only the very last one will be
> used (we might still want to access the older ones, but that would be a
> very seldom use-case.) At the moment, the keyspace has about 14 tables and
> is probably not going to evolve much.
>
>
> Are there any counter-indications of using lot of keyspaces (300+) into
> one Cassandra cluster?
> Are there any good practices that we should follow?
> After reading the "Anti-patterns in Cassandra > Too many keyspaces or
> tables", does it mean we should plan ahead to already split our keyspace
> among several clusters?
>
> Finally, would there be any other way to achieve what we want to do?
>
> Thanks for your help!
>
>  Jonathan
>
>
>
>
>

Mime
View raw message