cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian O'Neill <>
Subject Re: Cassandra to Oracle?
Date Sun, 22 Jan 2012 12:42:26 GMT

Thanks for all the ideas...

Since we can't predict all the values, we actually cut to Oracle today via a map/reduce job.
 Oracle is able to support all the ad hoc queries the users want (via Indexes), but the extract
job takes a long time (hours).  The users need more "real-time", which is driving us to look
at other alternatives, or better extract methods. (HDFS -> BulkLoad, JDBC, etc.)

We also have SOLR in place, which is indexing all the information.  That can satisfy >
40-50% of the queries, especially with the FieldGrouping, and some other features available
in 4.0:

But there are still cases that SOLR can't handle, because it has a flat document structure
and we need to query on multiple dimensions.

Eric, we were just about to head down the path you suggested, when we started seeing how heavy
the client-side code was going to get for inserts. (something we wanted to keep simple)  Also,
as I said, we aren't sure what attributes we'll be storing/querying, so some of the queries
we'll never be able to accommodate. Regardless, based on your comments though, I'm going to
take another look at using composite keys and counters.

Another approach may be REAL-TIME data replication...

We started looking at a "real-time" solution that would keep Oracle up to date with Cassandra
using Triggers.  Effectively we would use Cassandra as our transactional system (OLTP) and
leave Oracle in place for OLAP.  Looks like others have looked at exactly this model:

And there's been lots of discussion...

And mention that the crew was going to start working on it after 1.0:

But I didn't see anything in trunk, and I didn't get any response from the dev list.

Alas, we may pick it up this week and implement it. (maybe as part of Virgil)

If we use a column family to keep a distributed commit log of mutations, it should be a fairly
easy thing to get triggers in place.  Really the only question is where we code it?  We could
implement it in the Cassandra code as a patch, or we implement it on top.  I think we might
be able to do it using AOP, which would allow anyone to get the functionality just by dropping
another jar onto the classpath.

I'll see what we can come up with.

thanks again,

On Jan 21, 2012, at 8:35 AM, Eric Czech wrote:

> Hi Brian,
> We're trying to do the exact same thing and I find myself asking very similar questions.
> Our solution though has been to find what kind of queries we need to satisfy on a preemptive
basis and leverage cassandra's built-in indexing features to build those result sets beforehand.
 The whole point here then is that our gain in cost efficiency comes from the fact that disk
space is really cheap and serving up result sets from disk is fast provided that those result
sets are pre-calculated and reasonable in size (even if we don't know all the values upfront).
 For example, when you're writing to your CF "X", you could also make writes to column family
"A" like this:
> - write A[Z][Y] = 1
> where A = CF, Z = key, Y = column
> Answering the question "select count(distinct Y) from X group by Z" then is as simple
as getting a list of rows for CF A and counting the distinct values of Y and grouping them
by Z on the client side.
> Alternatively, there are much better ways to do this with composite keys/columns and
distributed counters but it's hard for me to tell what makes the most sense without knowing
more about your data / product requirements.
> Either way, I feel your pain in getting things like this to work with Cassandra when
the domain of values for a particular key or column is unknown and secondary indexing doesn't
apply, but I'm positive there's a much cheaper way to make it work than paying for Oracle
if you have at least a decent idea about what kinds of queries you need to satisfy (which
it sounds like you do).  To Maxim's "death by index" point, you could certainly go overboard
with this concept and cross a pricing threshold with some other database technology, but I
can't imagine you're even close to being in that boat given how concise your query needs seem
to be.
> If you're interested, I'd be happy to share how we do these things to save lots of money
over commercial databases and try to relate that to your use case, but if not, then I hope
at least some of that this useful for you.
> Good luck either way!
> On Fri, Jan 20, 2012 at 9:27 PM, Maxim Potekhin <> wrote:
> I certainly agree with "difficult to predict". There is a Danish
> proverb, which goes "it's difficult to make predictions, especially
> about the future".
> My point was that it's equally difficult with noSQL and RDBMS.
> The latter requires indexing to operate well, and that's a potential
> performance problem.
> On 1/20/2012 7:55 PM, Mohit Anchlia wrote:
> I think the problem stems when you have data in a column that you need
> to run adhoc query on which is not denormalized. In most cases it's
> difficult to predict the type of query that would be required.
> Another way of solving this could be to index the fields in search engine.
> On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhin<>  wrote:
> What makes you think that RDBMS will give you acceptable performance?
> I guess you will try to index it to death (because otherwise the "ad hoc"
> queries won't work well if at all), and at this point you may be hit with a
> performance penalty.
> It may be a good idea to interview users and build denormalized views in
> Cassandra, maybe on a separate "look-up" cluster. A few percent of users
> will be unhappy, but you'll find it hard to do better. I'm talking from my
> experience with an industrial strength RDBMS which doesn't scale very well
> for what you call "ad-hoc" queries.
> Regards,
> Maxim
> On 1/20/2012 9:28 AM, Brian O'Neill wrote:
> I can't remember if I asked this question before, but....
> We're using Cassandra as our transactional system, and building up quite a
> library of map/reduce jobs that perform data quality analysis, statistics,
> etc.
> (>  100 jobs now)
> But... we are still struggling to provide an "ad-hoc" query mechanism for
> our users.
> To fill that gap, I believe we still need to materialize our data in an
> Anyone have any ideas?  Better ways to support ad-hoc queries?
> Effectively, our users want to be able to select count(distinct Y) from X
> group by Z.
> Where Y and Z are arbitrary columns of rows in X.
> We believe we can create column families with different key structures
> (using Y an Z as row keys), but some column names we don't know / can't
> predict ahead of time.
> Are people doing bulk exports?
> Anyone trying to keep an RDBMS in synch in real-time?
> -brian
> --
> Brian ONeill
> Lead Architect, Health Market Science (
> mobile:215.588.6024
> blog:
> blog:

Brian ONeill
Lead Architect, Health Market Science (

View raw message