incubator-drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Re: Getting plugged in... (Cassandra and Drill?)
Date Mon, 21 Jan 2013 19:23:28 GMT
Hey Brian,

Welcome to the list!

Here are some thoughts

On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill <bone@alumni.brown.edu>wrote:

> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
>
> He gave us an overview of Drill, and I'm curious...
>
> Presently, we heavily use Storm + Cassandra.
>
> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
>
> We treat CRUD operations as events. Then within Storm we calculate
> aggregate counts of entities flowing through the system by various
> dimensions.   That works well, but we still need an ad hoc reporting
> capability, and a way to report on data in the system that is not
> active (historical).
>
> Seems like a great use case for Drill.


> Would it be possible to use the Drill engine against a Cassandra backend?
> If so, what does that mean?   (implementing some API?)
>

Yes.  One of our goals is to have a defined storage engine API with
required and optional features to add new data sources.  In fact, we have
DRILL-16 which is dependent on DRILL-13 which specifically outlines this
goal.  DRILL-13 is the base API and DRILL-16 is the Cassandra
implementation.  Depending on your level of interest and time, we would
love to have some help on DRILL-13.

>
> I assume that performance would be terrible unless somehow the data is
> stored using the columnar data format from the Dremel paper.  Is that
> accurate?  Does anyone know if anyone has attempted a translation of
> that format to Cassandra?
>
> One of the visions behind Dremel and Drill are that full table scans are
okay.  Part of the reason is the compact format of the data and the fact
that you only read important columns.  I'd expect that for many schema
designs, insitu-querying of Cassandra could be pretty effective.

One of the things we've talked about is supporting caching transformations.
 E.g. the first time you query a source, it may be automatically
reorganized in a more efficient format.  This works really well with HDFS's
write-once scheme.  Harder with something like Cassandra depending on how
your using it.



> Regardless, I'm very interested in getting involved and no stranger to
> getting my hands dirty.
> Let me know if you can provide any direction. (our entities are
> currently stored in JSON in Cassandra)
>
>
As mentioned above, if you wanted to start a discussion and work on
DRILL-13, that would be very helpful.  Since we're still very much in alpha
development right now, another helpful item would be to document your rough
schema, available secondary indexes and example queries/needs on the wiki.
 You could then translate those into Drill Logical plan syntax.  We could
use these as earlier test cases to ensure the system will support these
effectively.


Welcome,

Jacques



> -brian
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://brianoneill.blogspot.com/
> twitter: @boneill42
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message