cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <>
Subject Re: Advice for asymmetric reporting cluster architecture
Date Sun, 18 Oct 2015 02:12:33 GMT
Yes, you can have all your normal data centers with DSE configured for
real-time data access and then have a data center that shares the same data
but has DSE Search (Solr indexing) enabled. Your Cassandra data will get
replicated to the Search data center and then indexed there and only there.
You do need to have more RAM on the DSE Search nodes for the indexing, and
maybe more nodes as well to assure decent latency for complex queries.

-- Jack Krupansky

On Sat, Oct 17, 2015 at 3:54 PM, Mark Lewis <> wrote:

> I hadn't considered it because I didn't think it could be configured just
> for a single data center; can it?
> On Oct 17, 2015 8:50 AM, "Jack Krupansky" <>
> wrote:
>> Did you consider DSE Search in a DC?
>> -- Jack Krupansky
>> On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <> wrote:
>>> I've got an existing C* cluster spread across three data centers, and
>>> I'm wrestling with how to add some support for ad-hoc user reporting
>>> against (ideally) near real-time data.
>>> The type of reports I want to support basically boil down to allowing
>>> the user to select a single highly-denormalized "Table" from a predefined
>>> list, pick some filters (ideally with arbitrary boolean logic), project out
>>> some columns, and allow for some simple grouping and aggregation.  I've
>>> seen several companies expose reporting this way and it seems like a good
>>> way to avoid the complexity of joins while still providing a good deal of
>>> flexibility.
>>> Has anybody done this or have any recommendations?
>>> My current thinking is that I'd like to have the ad-hoc reporting
>>> infrastructure in separate data centers from our active production
>>> OLTP-type stuff, both to isolate any load away from the OLTP infrastructure
>>> and also because I'll likely need other stuff there (Spark?) to support
>>> ad-hoc reporting.
>>> So I basically have two problems:
>>> (1) Get an eventually-consistent view of the data into a data-center I
>>> can query against relativly quickly (so no big batch imports)
>>> (2) Be able to run ad-hoc user queries against it
>>> If I just think about query flexibility, I might consider dumping data
>>> into PostgreSQL nodes (practical because the data that any individual user
>>> can query will fit onto a single node).  But then I have the problem of
>>> getting the data there; I looked into an architecture using Kafka to pump
>>> data from the OLTP data centers to PostgreSQL mirrors, but down that road
>>> lies the need to manually deal with the eventual consistency.  Ugh.
>>> If I just run C* nodes in my reporting cluster that makes the problem of
>>> getting the data into the right place with eventual consistency easy to
>>> solve and I like that idea quite a lot, but then I need to run reporting
>>> against C*.  I could make the queries I need to run reasonably performant
>>> with enough secondary-indexes or materialized views (we're upgrading to 3.0
>>> soon), but I would need a lot of secondary-indexes and materialized views,
>>> and I'd rather not pay to store them in all of my data centers.  I wish
>>> there were a way to define secondary-indexes or materialized views to only
>>> exist in one DC of a cluster, but unless I've missed something it doesn't
>>> look possible.
>>> Any advice or case studies in this area would be greatly appreciated.
>>> -- Mark

View raw message