cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Harrison <cheetah...@gmail.com>
Subject Re: anyone using Cassandra as an analytics/data warehouse?
Date Wed, 05 Jan 2011 02:38:36 GMT
Okay, here is two ways to handle this, both are quite different from each
other.


A)

This approach does not depend on counters. You simply have a Column Family
with the row key being the Unix time divided by 60x60 and a column key of...
pretty much anything unique. Then have another process look at the current
row every hour to actually compile the numbers, and store the count in the
same Column Family. This will solve the first and third use cases, as it is
just a matter of looking at the right rows. The second case will require a
similar index, but one which includes a country code to be appended to the
row key.

The downside here is that you are storing lots of data on individual
requests and retaining it. If you don't want the detailed data you might add
a second process to purge the detail every hour.

B)

There is a "counter" feature added to the latest versions of Cassandra. I
have not used them, but they should be able to be used to achieve the same
effect without a second process cleaning up every hour. Also means it is
more of a real time system so you can see how many requests in the hour you
are currently in.



Basically you have to design your approach based on the query you will be
doing. Don't get too hung up on traditional data structures and queries as
they have little relationship to a Cassandra approach.


On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner <daveviner@gmail.com> wrote:

> Does anyone use Cassandra to power an analytics or data warehouse
> implementation?
>
> As a concrete example, one could imagine Cassandra storing data for
> something that reports on page-views on a website.  The basic notions might
> be simple (url as row-key and columns as timeuuids of viewers).  But, how
> would one store things like ip-geolocation to set of pages viewed?  Or
> hour-of-day to pages viewed?
>
> Also, how would one do a query like
> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010"?
> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
> from the US"?
> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
> from the US in the 9th hour of the day (in gmt)"?
>
> Time slicing and dimension slicing seems like it might be very challenging
> (especially since the windows of time would not be known in advance).
>
> Thanks
> Dave Viner
>

Mime
View raw message