cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Stevens <>
Subject Re: Getting the counters with the highest values
Date Mon, 24 Nov 2014 16:40:50 GMT
You're right that there's no way to use the counter data type to
materialize a view ordered by the counter.  Computing this post hoc is the
way to go if your needs allow for it (if not, something like Summingbird or
vanilla Storm may be necessary).

I might suggest that you make your primary key for your running totals by
day table be ((day), doc_id) because it will make it easy to compute the
materialized ordered view (SELECT doc_id, count FROM running_totals WHERE
day=?) unless you expect to have a really large number of documents getting
counts each day.

For your materialized ordering, I'd suggest a primary key of ((day), count)
as then for a given day you'll be able to select top by count (SELECT
count, doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).

One more thing to consider if your users are not all in a single timezone
is having your time bucket be hour instead of day so that you can answer by
day goal posted by local midnight (except the handful of locations that use
half hour timezone offsets) instead of a single global midnight.  You can
then either include either just each hour in each row (and aggregate at
read time), or you can make each row a rolling 24 hours (aggregating at
write time), depending on which use case fits your needs better.

On Sun Nov 23 2014 at 8:42:11 AM Robert Wille <> wrote:

> I’m working on moving a bunch of counters out of our relational database
> to Cassandra. For the most part, Cassandra is a very nice fit, except for
> one feature on our website. We manage a time series of view counts for each
> document, and display a list of the most popular documents in the last
> seven days. This seems like a pretty strong anti-pattern for Cassandra, but
> also seems like something a lot of people would want to do. If you’re
> keeping counters, its pretty likely that you’d want to know which ones have
> the highest counts.
> Here’s what I came up with to implement this feature. Create a counter
> table with primary key (doc_id, day) and a single counter. Whenever a
> document is viewed, increment the counter for the document for today and
> the previous six days. Sometime after midnight each day, compile the
> counters into a table with primary key (day, count, doc_id) and no
> additional columns. For each partition in the counter table, I would sum up
> the counters, delete any counters that are over a week old, and put the sum
> into the second table with day = today. When I query the table, i would ask
> for data where day = yesterday. During the compilation process, I would
> delete old partitions. In theory I’d only need two partitions. One that is
> being built, and one for querying.
> I’d be interested to hear critiques on this strategy, as well as hearing
> how other people have implemented a "most-popular" feature using Cassandra
> counters.
> Robert

View raw message