cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Lewis <m...@lewisworld.org>
Subject Advice for asymmetric reporting cluster architecture
Date Sat, 17 Oct 2015 15:30:21 GMT
I've got an existing C* cluster spread across three data centers, and I'm
wrestling with how to add some support for ad-hoc user reporting against
(ideally) near real-time data.

The type of reports I want to support basically boil down to allowing the
user to select a single highly-denormalized "Table" from a predefined list,
pick some filters (ideally with arbitrary boolean logic), project out some
columns, and allow for some simple grouping and aggregation.  I've seen
several companies expose reporting this way and it seems like a good way to
avoid the complexity of joins while still providing a good deal of
flexibility.

Has anybody done this or have any recommendations?

My current thinking is that I'd like to have the ad-hoc reporting
infrastructure in separate data centers from our active production
OLTP-type stuff, both to isolate any load away from the OLTP infrastructure
and also because I'll likely need other stuff there (Spark?) to support
ad-hoc reporting.

So I basically have two problems:
(1) Get an eventually-consistent view of the data into a data-center I can
query against relativly quickly (so no big batch imports)
(2) Be able to run ad-hoc user queries against it

If I just think about query flexibility, I might consider dumping data into
PostgreSQL nodes (practical because the data that any individual user can
query will fit onto a single node).  But then I have the problem of getting
the data there; I looked into an architecture using Kafka to pump data from
the OLTP data centers to PostgreSQL mirrors, but down that road lies the
need to manually deal with the eventual consistency.  Ugh.

If I just run C* nodes in my reporting cluster that makes the problem of
getting the data into the right place with eventual consistency easy to
solve and I like that idea quite a lot, but then I need to run reporting
against C*.  I could make the queries I need to run reasonably performant
with enough secondary-indexes or materialized views (we're upgrading to 3.0
soon), but I would need a lot of secondary-indexes and materialized views,
and I'd rather not pay to store them in all of my data centers.  I wish
there were a way to define secondary-indexes or materialized views to only
exist in one DC of a cluster, but unless I've missed something it doesn't
look possible.

Any advice or case studies in this area would be greatly appreciated.

-- Mark

Mime
View raw message