cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7890) LCS and time series data
Date Fri, 05 Sep 2014 21:59:28 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123668#comment-14123668
] 

Jonathan Ellis commented on CASSANDRA-7890:
-------------------------------------------

bq. Im curious about the historical choice to order data on disk by token and not key.

That means that adding new nodes means you stream contiguous ranges.

> LCS and time series data
> ------------------------
>
>                 Key: CASSANDRA-7890
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7890
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Dan Hendry
>             Fix For: 3.0
>
>
> Consider the following very typical schema for bucketed time series data:
> {noformat}
> CREATE TABLE user_timeline (
> 	ts_bucket bigint,
> 	username varchar,
> 	ts timeuuid,
> 	data blob,
> 	PRIMARY KEY ((ts_bucket, username), ts))
> {noformat}
> If you have a single cassandra node (or cluster where RF = N) and use the ByteOrderedPartitioner,
LCS becomes *ridiculously*, *obscenely*, efficient. Under a typical workload where data is
inserted in order, compaction IO could be reduced to *near zero* as sstable ranges dont overlap
(with a trivial change to LCS so sstables with no overlap are not rewritten when being promoted
into the next level). Better yet, we don't _require_ ordered data insertion. Even if insertion
order is completely random, you still get standard LCS performance characteristics which are
usually acceptable (although I believe there are a few degenerate compaction cases which are
not handled in the current implementation). A quick benchmark using vanilla cassandra 2.0.10
(ie no rewrite optimization) shows a *77% reduction in compaction IO* when switching from
the Murmur3Partitioner to the ByteOrderedPartitioner.
> The obvious problem is, of course, that using an order preserving partitioner is a Very
Bad idea when N > RF. Using an OPP for time series data ordered by time is utter lunacy.
> It seems to me that one solution is to split apart the roles of the partitioner so that
data distribution across the cluster and data ordering on disk can be controlled independently.
Ideally on disk ordering could be set per CF. Im curious about the historical choice to order
data on disk by token and not key. Randomized (hashed key ordered) distribution across the
cluster is obviously a good idea but natural key ordered on disk seem like it would have a
number of advantages:
> * Better read performance and file system page cache efficiency for any workload which
access certain ranges of row keys more frequently than others (this applies to _many_ use
cases beyond time series data).
> * I can't think of a realistic workload where CRUD operations would be noticeably less
performant when using natural instead of hash ordering. 
> * Better compression ratios (although probably only for skinny rows).
> * Range based truncation becomes feasible.
> * Ordered range scans might be feasible to implement even with random cluster distribution.
> The only things I can think of which could suffer when using different cluster and disk
ordering are bootstrap and repair. Although I have no evidence, the massive potential performance
gains certainly still seem to be worth it.
> Thoughts? This approach seems to be fundamentally different from other tickets related
to improving time series data (CASSANDRA-6602, CASSANDRA-5561) which focus only on new or
modified compaction strategies. By changing data sort order, existing compaction strategies
can be made significantly more efficient without imposing new, restrictive, and use case specific
limitations on the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message