cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)
Date Mon, 21 Dec 2015 18:38:48 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049151#comment-15049151
] 

Ariel Weisberg edited comment on CASSANDRA-8844 at 12/21/15 6:38 PM:
---------------------------------------------------------------------

I don't want to scope creep this ticket. I think that this is heading the right direction
in terms of deferring most of the functionality around consumption of CDC data and getting
a good initial implementation of buffering and writing the data.

I do want to splat somewhere my thoughts on the consumption side. VoltDB had a CDC feature
that went through several iterations over the years as we learned what did and didn't work.

The original implementation was a wire protocol that clients could connect to. The protocol
was a pain and the client had to be a distributed system with consensus in order to load balance
and fail over across multiple client instances and the implementation we maintained for people
to plug into was a pain because we had to connect to all the nodes to acknowledge consumed
CDC data at replicas. And all of this was without the benefit of already being a cluster member
with access to failure information. The clients also had to know way too much about cluster
internals and topology to do it well.

For the rewrite I ended up hosting CDC data processors inside the server. In practice this
is not as scary as it may sound to some. Most of the processors were written by us, and there
wasn't a ton they could do to misbehave without trying really hard and if they did that it
was on them. It didn't end up being a support or maintenance headache, and I don't think we
had instances of the CDC processing destabilizing things.

You could make the data available over a socket as one of these processors, there was a JDBC
processor to insert into a database via JDBC, there was a Kafka processor to load data into
Kafka, one to load the data into another VoltDB instance, and a processor that wrote the data
to local disk as a CSV etc.

The processor implemented by users didn't have to do anything to deal with fail over and load
balancing of consuming data. The database hosting the processor would only pass data for a
given range on the hash ring to one processor at a time. When a processor acknowledged data
as committed downstream the database transparently sends the acknowledgement to all replicas
allowing them to release persisted CDC data. VoltDB runs ZooKeeper on top of VoltDB internally
so this was pretty easy to implement inside VoltDB, but outside it would have been a pain.

The goal was that CDC data would never hit the filesystem, and that if it hit the filesystem
it wouldn't hit disk if possible. Heap promotion and survivor copying had to be non-existent
to avoid having an impact on GC pause time. With TPC and buffering mutations before passing
them to the processors we had no problem getting data out at disk or line rate. Reclaiming
spaced ended up being file deletion so that was cheap as well.


was (Author: aweisberg):
I don't want to scope creep this ticket. I think that this is heading the write direction
in terms of deferring most of the functionality around consumption of CDC data and getting
a good initial implementation of buffering and writing the data.

I do want to splat somewhere my thoughts on the consumption side. VoltDB had a CDC feature
that went through several iterations over the years as we learned what did and didn't work.

The original implementation was a wire protocol that clients could connect to. The protocol
was a pain and the client had to be a distributed system with consensus in order to load balance
and fail over across multiple client instances and the implementation we maintained for people
to plug into was a pain because we had to connect to all the nodes to acknowledge consumed
CDC data at replicas. And all of this was without the benefit of already being a cluster member
with access to failure information. The clients also had to know way too much about cluster
internals and topology to do it well.

For the rewrite I ended up hosting CDC data processors inside the server. In practice this
is not as scary as it may sound to some. Most of the processors were written by us, and there
wasn't a ton they could do to misbehave without trying really hard and if they did that it
was on them. It didn't end up being a support or maintenance headache, and I don't think we
had instances of the CDC processing destabilizing things.

You could make the data available over a socket as one of these processors, there was a JDBC
processor to insert into a database via JDBC, there was a Kafka processor to load data into
Kafka, one to load the data into another VoltDB instance, and a processor that wrote the data
to local disk as a CSV etc.

The processor implemented by users didn't have to do anything to deal with fail over and load
balancing of consuming data. The database hosting the processor would only pass data for a
given range on the hash ring to one processor at a time. When a processor acknowledged data
as committed downstream the database transparently sends the acknowledgement to all replicas
allowing them to release persisted CDC data. VoltDB runs ZooKeeper on top of VoltDB internally
so this was pretty easy to implement inside VoltDB, but outside it would have been a pain.

The goal was that CDC data would never hit the filesystem, and that if it hit the filesystem
it wouldn't hit disk if possible. Heap promotion and survivor copying had to be non-existent
to avoid having an impact on GC pause time. With TPC and buffering mutations before passing
them to the processors we had no problem getting data out at disk or line rate. Reclaiming
spaced ended up being file deletion so that was cheap as well.

> Change Data Capture (CDC)
> -------------------------
>
>                 Key: CASSANDRA-8844
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Coordination, Local Write-Read Paths
>            Reporter: Tupshin Harper
>            Assignee: Joshua McKenzie
>            Priority: Critical
>             Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns used to
determine (and track) the data that has changed so that action can be taken using the changed
data. Also, Change data capture (CDC) is an approach to data integration that is based on
the identification, capture and delivery of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for mission critical
data in large enterprises, it is increasingly being called upon to act as the central hub
of traffic and data flow to other systems. In order to try to address the general need, we
(cc [~brianmhess]), propose implementing a simple data logging mechanism to enable per-table
CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its Consistency Level
semantics, and in order to treat it as the single reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable (deliver-at-least-once with
possible mechanisms for deliver-exactly-once ) continuous semi-realtime feeds of mutations
going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they don't have
to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, give them
the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, with the
following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an easy enhancement,
but most likely use cases would prefer to only implement CDC logging in one (or a subset)
of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the commitlog,
failure to write to the CDC log should fail that node's write. If that means the requested
consistency level was not met, then clients *should* experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to reconstitute
rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons written in
non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also believe
that they can be deferred for a subsequent release, and to guage actual interest.
> - Multiple logs per table. This would make it easy to have multiple "subscribers" to
a single table's changes. A workaround would be to create a forking daemon listener, but that's
not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters would make
Casandra a much more versatile feeder into other systems, and again, reduce complexity that
would otherwise need to be built into the daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a predictable naming schema, making it triivial to process
them in order.
> - Daemons should be able to checkpoint their work, and resume from where they left off.
This means they would have to leave some file artifact in the CDC log's directory.
> - A sophisticated daemon should be able to be written that could 
> -- Catch up, in written-order, even when it is multiple logfiles behind in processing
> -- Be able to continuously "tail" the most recent logfile and get low-latency(ms?) access
to the data as it is written.
> h2. Alternate approach
> In order to make consuming a change log easy and efficient to do with low latency, the
following could supplement the approach outlined above
> - Instead of writing to a logfile, by default, Cassandra could expose a socket for a
daemon to connect to, and from which it could pull each row.
> - Cassandra would have a limited buffer for storing rows, should the listener become
backlogged, but it would immediately spill to disk in that case, never incurring large in-memory
costs.
> h2. Additional consumption possibility
> With all of the above, still relevant:
> - instead (or in addition to) using the other logging mechanisms, use CQL transport itself
as a logger.
> - Extend the CQL protoocol slightly so that rows of data can be return to a listener
that didn't explicit make a query, but instead registered itself with Cassandra as a listener
for a particular event type, and in this case, the event type would be anything that would
otherwise go to a CDC log.
> - If there is no listener for the event type associated with that log, or if that listener
gets backlogged, the rows will again spill to the persistent storage.
> h2. Possible Syntax
> {code:sql}
> CREATE TABLE ... WITH CDC LOG
> {code}
> Pros: No syntax extesions
> Cons: doesn't make it easy to capture the various permutations (i'm happy to be proven
wrong) of per-dc logging. also, the hypothetical multiple logs per table would break this
> {code:sql}
> CREATE CDC_LOG mylog ON mytable WHERE MyUdf(mycol1, mycol2) = 5 with DCs={'dc1','dc3'}
> {code}
> Pros: Expressive and allows for easy DDL management of all aspects of CDC
> Cons: Syntax additions. Added complexity, partly for features that might not be implemented



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message