CREATE TABLE requests (
    request_id UUID PRIMARY KEY,
    partition_number INT,
    payload ASCII
 );

CREATE INDEX ON requests(partition_number);
If reading all the request in an hour is something you do frequently than I strongly recommend modelling that  with another table. 

e.g. 

CREATE TABLE requests_by_hour (
    hour long, # YYYYMMDDHH
    request_id UUID ,
    partition_number INT,
    payload ASCII
    PRIMARY KEY (hour, request_id)
 );

Check how much data you have per hour and split it further if more than a few 10's of MB.

Currently having several million rows I observe very poor performance of such queries, and realize that secondary index on the field with high cardinality is a bad idea. However, I don't see good alternatives so far.
Above. 
There may also be an issue here about how Hadoop is allocating jobs to nodes as it tries to be smart about this based on the token assignments for the nodes. 

I was considering creating a temp column family every hour, write data there, process it with Hadoop next hour and throw it away, however there is a limitation - we need to store the raw incoming data, as in the future we'll have to provide new types of analytic reports.
This is where we say "denormalise to support reads". 
Store the raw requests as you were so there is a "database of record" sitting there. 
Denormalise / copy into a table that is tuned for the per hour read case. If needed use Deflate compression on that CF to reduce size, not that this will be slower. 
 
1.  Does the approach with hourly running Hadoop jobs is solid for the near-realtime analytics (when results should be available within 1 hour), or it's better to take a look at Storm and something like that?
Working on the data per hour (or some other measure) is something people often do. But you need to support the process in the data model as above. 

people have also been using storm with cassanda. 


2.  What's the recommended  data schema to store events "sharded" by hour, with further possibility to quickly retrieve them by hour? (assuming the hourly amount of data is big enough to fit in one wide row.)
Above and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 27/09/2013, at 5:13 AM, Renat Gilfanov <grennat@mail.ru> wrote:

Hello,

We have a column family which stores incoming requests, and we would like to perform some analytics  on that data using Hadoop. The analytic results should be available pretty soon, not realtime, but within an hour or so.
So we store the current hour number (calculated from timestamp) as a "partition number" field with secondary index.

Currently it looks like this (I skipped a few columns to avoid unnecessary details):

CREATE TABLE requests (
    request_id UUID PRIMARY KEY,
    partition_number INT,
    payload ASCII
 );

CREATE INDEX ON requests(partition_number);

Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop performs query over the indexed "partition_number" column.
Currently having several million rows I observe very poor performance of such queries, and realize that secondary index on the field with high cardinality is a bad idea. However, I don't see good alternatives so far.
I was considering creating a temp column family every hour, write data there, process it with Hadoop next hour and throw it away, however there is a limitation - we need to store the raw incoming data, as in the future we'll have to provide new types of analytic reports.

So my questions are the following:

1.  Does the approach with hourly running Hadoop jobs is solid for the near-realtime analytics (when results should be available within 1 hour), or it's better to take a look at Storm and something like that?
2.  What's the recommended  data schema to store events "sharded" by hour, with further possibility to quickly retrieve them by hour? (assuming the hourly amount of data is big enough to fit in one wide row.)


Thank you.