We have a column family which stores incoming requests, and we would like to perform some analytics on that data using Hadoop. The analytic results should be available pretty soon, not realtime, but within an hour or so.
So we store the current hour number (calculated from timestamp) as a "partition number" field with secondary index.
Currently it looks like this (I skipped a few columns to avoid unnecessary details):
CREATE TABLE requests (
request_id UUID PRIMARY KEY,
CREATE INDEX ON requests(partition_number);
Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop performs query over the indexed "partition_number" column.
Currently having several million rows I observe very poor performance of such queries, and realize that secondary index on the field with high cardinality is a bad idea. However, I don't see good alternatives so far.
I was considering creating a temp column family every hour, write data there, process it with Hadoop next hour and throw it away, however there is a limitation - we need to store the raw incoming data, as in the future we'll have to provide new types of analytic reports.
So my questions are the following:
1. Does the approach with hourly running Hadoop jobs is solid for the near-realtime analytics (when results should be available within 1 hour), or it's better to take a look at Storm and something like that?
2. What's the recommended data schema to store events "sharded" by hour, with further possibility to quickly retrieve them by hour? (assuming the hourly amount of data is big enough to fit in one wide row.)