cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-7666) Range-segmented sstables
Date Wed, 13 Aug 2014 16:50:17 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-7666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Ellis updated CASSANDRA-7666:
--------------------------------------

    Description: 
It would be useful to segment sstables by data range (not just token range as envisioned by
CASSANDRA-6696).

The primary use case is to allow deleting those data ranges for "free" by dropping the sstables
involved.  We should also (possibly as a separate ticket) be able to leverage this information
in query planning to avoid unnecessary sstable reads.

Relational databases typically call this "partitioning" the table, but obviously we use that
term already for something else: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

Tokutek's take for mongodb: http://docs.tokutek.com/tokumx/tokumx-partitioned-collections.html

  was:
It would be useful to segment sstables by data range (not just token range as envisioned by
CASSANDRA-6696).

The primary use case is to allow deleting those data ranges for "free" by dropping the sstables
involved.  (We may also be able to leverage this information in compaction.)

Relational databases typically call this "partitioning" the table, but obviously we use that
term already for something else: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

Tokutek's take for mongodb: http://docs.tokutek.com/tokumx/tokumx-partitioned-collections.html


postgresql and (to a lesser degree) tokutek both offer very general solutions.  This can be
useful but it also means that adding new segments must be done manually, e.g. with a cron
job at the end of the month to create "next month's" partition.

What if instead we added a {{WITH SEGMENTING BY}} clause to {{CREATE TABLE}} that operated
on the data itself automatically?

{code}
CREATE TABLE stocks (
  exchange text, 
  day datetime, 
  day_time datetime,
  symbol text, 
  price int,
  volume int,
  tags set<text>,
  PRIMARY KEY ((exchange, day, symbol), day_time)
)
WITH SEGMENTING BY date_part('month', day);
{code}

Note that this would behave differently from manual partitioning in an important case: if
we drop the partition for July 2014, then later we insert more data for that month, it will
just create a new segment to accomodate that.  With postgresql CHECK or tokutek max, inserting
to a dropped partition would error out.  (I don't think either behavior is necessarily more
correct, it's just a difference to be aware of.)

> Range-segmented sstables
> ------------------------
>
>                 Key: CASSANDRA-7666
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7666
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Sam Tunnicliffe
>             Fix For: 3.0
>
>
> It would be useful to segment sstables by data range (not just token range as envisioned
by CASSANDRA-6696).
> The primary use case is to allow deleting those data ranges for "free" by dropping the
sstables involved.  We should also (possibly as a separate ticket) be able to leverage this
information in query planning to avoid unnecessary sstable reads.
> Relational databases typically call this "partitioning" the table, but obviously we use
that term already for something else: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
> Tokutek's take for mongodb: http://docs.tokutek.com/tokumx/tokumx-partitioned-collections.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message