incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renat Gilfanov <gren...@mail.ru>
Subject data schema for hourly runnning analytics
Date Thu, 26 Sep 2013 17:13:38 GMT
 Hello,

We have a column family which stores incoming requests, and we would like to perform some
analytics  on that data using Hadoop. The analytic results should be available pretty soon,
not realtime, but within an hour or so. 
So we store the current hour number (calculated from timestamp) as a "partition number" field
with secondary index.

Currently it looks like this (I skipped a few columns to avoid unnecessary details):

CREATE TABLE requests (
    request_id UUID PRIMARY KEY,
    partition_number INT,
    payload ASCII
 );

CREATE INDEX ON requests(partition_number);

Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop performs query
over the indexed "partition_number" column.
Currently having several million rows I observe very poor performance of such queries, and
realize that secondary index on the field with high cardinality is a bad idea. However, I
don't see good alternatives so far.
I was considering creating a temp column family every hour, write data there, process it with
Hadoop next hour and throw it away, however there is a limitation - we need to store the raw
incoming data, as in the future we'll have to provide new types of analytic reports.

So my questions are the following:

1.  Does the approach with hourly running Hadoop jobs is solid for the near-realtime analytics
(when results should be available within 1 hour), or it's better to take a look at Storm and
something like that?
2.  What's the recommended  data schema to store events "sharded" by hour, with further
possibility to quickly retrieve them by hour? (assuming the hourly amount of data is big enough
to fit in one wide row.)


Thank you.



Mime
View raw message