kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adar Lieber-Dembo <a...@cloudera.com>
Subject Re: Change Data Capture (CDC) with Kudu
Date Fri, 22 Sep 2017 17:43:24 GMT

Thanks for the detailed description of your problem.

I'm afraid there's no such mechanism in Kudu today. Mining the WALs seems
like a path fraught with land mines. Kudu GCs WAL segments aggressively so
I'd be worried about a listening mechanism missing out on some row
operations. Plus the WAL is Raft-specific as it includes both REPLICATE
messages (reflecting a Write RPC from a client) and COMMIT messages
(written out when a majority of replicas have written a REPLICATE); parsing
and making sense of this would be challenging. Perhaps you could build
something using Linux's inotify system for receiving file change
notifications, but again I'd be worried about missing certain updates.

Another option is to replicate the data at the OS level. For example, you
could periodically rsync the entire cluster onto a standby cluster. There's
bound to be data loss in the event of a failover, but I don't think you'll
run into any corruption (though Kudu does take advantage of sparse files
and hole punching, so you should verify that any tool you use supports

Disaster Recovery is an oft-requested feature, but one that Kudu developers
have been unable to prioritize yet. Would you or your someone on your team
be interested in working on this?

On Thu, Sep 21, 2017 at 7:12 PM Franco Venturi <fventuri@comcast.net> wrote:

> We are planning for a 50-100TB Kudu installation (about 200 tables or so).
> One of the requirements that we are working on is to have a secondary copy
> of our data in a Disaster Recovery data center in a different location.
> Since we are going to have inserts, updates, and deletes (for instance in
> the case the primary key is changed), we are trying to devise a process
> that will keep the secondary instance in sync with the primary one. The two
> instances do not have to be identical in real-time (i.e. we are not looking
> for synchronous writes to Kudu), but we would like to have some pretty good
> confidence that the secondary instance contains all the changes that the
> primary has up to say an hour before (or something like that).
> So far we considered a couple of options:
> - refreshing the seconday instance with a full copy of the primary one
> every so often, but that would mean having to transfer say 50TB of data
> between the two locations every time, and our network bandwidth constraints
> would prevent to do that even on a daily basis
> - having a column that contains the most recent time a row was updated,
> however this column couldn't be part of the primary key (because the
> primary key in Kudu is immutable), and therefore finding which rows have
> been changed every time would require a full scan of the table to be
> sync'd. It would also rely on the "last update timestamp" column to be
> always updated by the application (an assumption that we would like to
> avoid), and would need some other process to take into accounts the rows
> that are deleted.
> Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of
> 'Change Data Capture' mechanism where only the 'deltas' are captured and
> applied to the secondary instance, we were wondering if there's any way in
> Kudu to achieve something like that (possibly mining the WALs, since my
> understanding is that each change gets applied to the WALs first).
> Thanks,
> Franco Venturi

View raw message