cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Svihla>
Subject Re: Replicating Cassandra data to HDFS
Date Tue, 09 Aug 2016 18:26:57 GMT

You know I've not actually spent the hour to read the ticket so I was just guessing it didn't
handle dedup...all the same semantics apply'd have to do a read before write and
then allow some window of failure mode. Maybe if you were LWT everything but that sounds really
slow...I'd be curious of your thoughts on how to do that well..maybe I'm missing something.

Ryan Svihla

On Aug 9, 2016, 1:13 PM -0500, Jonathan Haddad <>, wrote:
> I'm having a hard time seeing how anyone would be able to work with CDC in it's currently
implementation of not doing any dedupe. Unless you really want to write all your own logic
for that including failure handling + a distributed state machine I wouldn't count on it as
a solution.
> On Tue, Aug 9, 2016 at 10:49 AM Ryan Svihla < (>
> > You can follow the monster of a ticket
and see if it looks like the tradeoffs there are headed in the right direction for you.
> >
> > even CDC I think would have the logically same issue of not deduping for you as
triggers and dual write due to replication factor and consistently level issues. Otherwise
you'd be stuck doing an all replica comparison when a late event came in and when a node was
down what would you do then? what if one replica got it as well and then came on line much
later? Even if you were using a single source of truth style database, you'll find failover
has a way of losing late events anyway (due to async replication) not to mention once you
go multiple dc it's all a matter of what DC you're in.
> >
> > Anyway for the cold storage I think a trailing amount that is just greater than
your old events would do it. IE if you choose to only accept 30 days out then cold storage
for 32 days. At some point there is no free lunch as you point out when replicating between
two data sources. ie CDC, triggers really anything that marks a "new event" will have the
same problem and you'll have to choose an acceptable level of lateness or check for lateness
> >
> > Alternatively you can just accept duplication and handle it cold storage read side
(like event sourcing pattern, this would be ideal if the lateness is uncommon) or clean it
up over time in cold storage as it's detected (similar to an event sourcing pattern, but snapshotting
data down to a single record when you encounter it on a read).
> >
> > Best of luck, this is a corner case that requires hard tradeoffs in all technology
I've encountered.
> >
> > Regards,
> > Ryan Svihla
> >
> >
> >
> > On Aug 9, 2016, 12:21 PM -0500, Ben Vogan < (>,
> > > Thanks Ryan. I was hoping there was a change data capture framework. We have
late arriving events, some of which can be very late. We would have to batch collect data
for a large time period every so often to go back and collect those or accept that we are
going to lose a small percentage of events. Neither of which is ideal.
> > >
> > > On Tue, Aug 9, 2016 at 10:30 AM, Ryan Svihla < (>
> > > > The typical pattern I've seen in the field is kafka + consumers for each
destination (variant of dual write I know), this of course would not work for your goal of
relying on C* for dedup. Triggers would also suffer the same problem unfortunately so you're
really left with a batch job (most likely Spark) to move data from C* into HDFS on a given
interval. If this is really a cold storage use case that can work quite well especially assuming
you've modeled your data as a time series or with some sort of time based bucketing so you
can quickly get full partitions data out of C* in a deterministic fashion and not have to
scan your entire data set.
> > > >
> > > > I've also for similar needs have seen Spark streaming + querying cassandra
for duplication checks to dedup then output to another source (form of dual write but with
dedup), this was really silly and slow. I only bring it up to save you the trouble in case
you end up in the same path chasing for something more 'real time'.
> > > >
> > > > Regards,
> > > > Ryan Svihla
> > > >
> > > >
> > > > On Aug 9, 2016, 11:09 AM -0500, Ben Vogan < (>,
> > > > > Hi all,
> > > > >
> > > > > We are investigating using Cassandra in our data platform. We would
like data to go into Cassandra first and to eventually be replicated into our data lake in
HDFS for long term cold storage. Does anyone know of a good way of doing this? We would rather
not have parallel writes to HDFS and Cassandra because we were hoping that we could use Cassandra
primary keys to de-duplicate events.
> > > > >
> > > > > Thanks, --
> > > > >
> > > > > BENJAMIN VOGAN | Data Platform Team Lead
> > > > > shopkick (
> > > > > The indispensable app that rewards you for shopping.
> > >
> > >
> > > --
> > >
> > > BENJAMIN VOGAN | Data Platform Team Lead
> > > shopkick (
> > > The indispensable app that rewards you for shopping.
View raw message