incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan King <r...@twitter.com>
Subject Distributed Counters Use Cases
Date Wed, 06 Oct 2010 22:30:08 GMT
In the spirit of making sure we have clear communication about our
work, I'd like to outline the use cases Twitter has for distributed
counters. I expect that many of you using Cassandra currently or in
the future will have similar use cases.

The first use case is pretty simple: high scale counters. Our Tweet
button [1] is powered by #1072 counters. We could every mention of
every url that comes through a public tweet. As you would expect,
there are a lot of urls and a lot of traffic to this widget (its on
many high traffic sites, though it is highly cached).

The second is a bit more complex: time series data. We have built
infrastructure that can process logs (in real time from scribe) or
other events  and convert them into a series of keys to increment,
buffer the data for 1 minute and increment those keys. For logs, each
aggregator would do its on increment (so per thing you're tracking you
get an increment for each aggregator), but for events it'll be one
increment per event. We plan to open source all of this soon.

We're hoping to soon start replacing our ganglia clusters with this.
For the ganglia use-case we end up with a large number or increments
for every read. For monitoring data, even a reasonably sized fleet
with a moderate number of metrics can generate a huge amount of data.
Imagine you have 500 machines (not how many we have) and measure 300
(a reasonable estimate based on our experience) metrics per machine.
Suppose you want to measure these things every minute and roll the
values up every hour, day, month and for all time. Suppose also that
you were tracking sum, count, min, max, and sum of squares (so that
you can do standard deviation).  You also want to track these metrics
across groups like web hosts, databases, datacenters, etc.

These basic assumptions would mean this kind of traffic:

(500      +  100  ) *   300   *      5             *     4
3,600,000 increments/minute
(machines   groups)    metrics  time granularities   aggregates

Read traffic, being employee-only would be negligible compared to this.

One other use case is that for many of the metrics we track, we want
to track the usage across several facets.

For example [2] to build our local trends feature, you could store a
time series of terms per city. In this case supercolumns would be a
natural fit because the set of facets is unknown and open:

Imagine a CF that has data like this:

city0 => hour0 => { term1 => 2, term2 => 1000, term3 => 1}, hour1 => {
term5 => 2, term2 => 10}
city1 => hour0 => { term12 => 3, term0 => 500, term3 => 1}, hour1 => {
term5 => 2, term2 => 10}

Of course, there are some other ways to model this data– you could
collapse the subcolumn names into the column names and re-do how you
slice (you have to slice anyway). You have to have fixed width terms
then, though:

city0 => { hour0 + term1 => 2, hour0 + term2 => 1000, hour0 + term3 =>
1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10}
city1 => { hour0 + term12 => 3, hour0 + term0 => 500, hour0 + term3 =>
1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10}

This is doable, but could be rough.

The other option is to have a separate row for each facet (with a
compound key of [city, term]), and build a custom comparator that only
looks at the first part for generating the token, they we have to do
range slices to get all the facets. Again, doable, but not pretty.


-ryan


1. http://twitter.com/goodies/tweetbutton
2. this is not how we actually do this, but it would be a reasonable approach.

Mime
View raw message