incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: data schema for hourly runnning analytics
Date Fri, 27 Sep 2013 01:15:23 GMT
> CREATE TABLE requests (
>     request_id UUID PRIMARY KEY,
>     partition_number INT,
>     payload ASCII
>  );
> 
> CREATE INDEX ON requests(partition_number);
If reading all the request in an hour is something you do frequently than I strongly recommend
modelling that  with another table. 

e.g. 

CREATE TABLE requests_by_hour (
    hour long, # YYYYMMDDHH
    request_id UUID ,
    partition_number INT,
    payload ASCII
    PRIMARY KEY (hour, request_id)
 );

Check how much data you have per hour and split it further if more than a few 10's of MB.

> Currently having several million rows I observe very poor performance of such queries,
and realize that secondary index on the field with high cardinality is a bad idea. However,
I don't see good alternatives so far.
Above. 
There may also be an issue here about how Hadoop is allocating jobs to nodes as it tries to
be smart about this based on the token assignments for the nodes. 

> I was considering creating a temp column family every hour, write data there, process
it with Hadoop next hour and throw it away, however there is a limitation - we need to store
the raw incoming data, as in the future we'll have to provide new types of analytic reports.
This is where we say "denormalise to support reads". 
Store the raw requests as you were so there is a "database of record" sitting there. 
Denormalise / copy into a table that is tuned for the per hour read case. If needed use Deflate
compression on that CF to reduce size, not that this will be slower. 
 
> 1.  Does the approach with hourly running Hadoop jobs is solid for the near-realtime
analytics (when results should be available within 1 hour), or it's better to take a look
at Storm and something like that?
Working on the data per hour (or some other measure) is something people often do. But you
need to support the process in the data model as above. 

people have also been using storm with cassanda. 


> 2.  What's the recommended  data schema to store events "sharded" by hour, with further
possibility to quickly retrieve them by hour? (assuming the hourly amount of data is big enough
to fit in one wide row.)
Above and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 27/09/2013, at 5:13 AM, Renat Gilfanov <grennat@mail.ru> wrote:

> Hello,
> 
> We have a column family which stores incoming requests, and we would like to perform
some analytics  on that data using Hadoop. The analytic results should be available pretty
soon, not realtime, but within an hour or so. 
> So we store the current hour number (calculated from timestamp) as a "partition number"
field with secondary index.
> 
> Currently it looks like this (I skipped a few columns to avoid unnecessary details):
> 
> CREATE TABLE requests (
>     request_id UUID PRIMARY KEY,
>     partition_number INT,
>     payload ASCII
>  );
> 
> CREATE INDEX ON requests(partition_number);
> 
> Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop performs
query over the indexed "partition_number" column.
> Currently having several million rows I observe very poor performance of such queries,
and realize that secondary index on the field with high cardinality is a bad idea. However,
I don't see good alternatives so far.
> I was considering creating a temp column family every hour, write data there, process
it with Hadoop next hour and throw it away, however there is a limitation - we need to store
the raw incoming data, as in the future we'll have to provide new types of analytic reports.
> 
> So my questions are the following:
> 
> 1.  Does the approach with hourly running Hadoop jobs is solid for the near-realtime
analytics (when results should be available within 1 hour), or it's better to take a look
at Storm and something like that?
> 2.  What's the recommended  data schema to store events "sharded" by hour, with further
possibility to quickly retrieve them by hour? (assuming the hourly amount of data is big enough
to fit in one wide row.)
> 
> 
> Thank you.
> 
> 
> 


Mime
View raw message