Ed,
I could be completely wrong about this working--I haven't specifically
looked at how the counts are executed, but I think this makes sense.
You could potentially shard across several rows, based on a hash of
the username combined with the time period as the row key. Run a
count across each row and then add them up. If your cluster is large
enough this could spread the computation enough to make each query for
the count a bit faster.
Depending on how often this query would be hit, I would still
recommend caching, but you could calculate reality a little more
often.
Zach
On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff <ed@anuff.com> wrote:
> I'm looking at the scenario of how to keep track of the number of
> unique visitors within a given time period. Inserting user ids into a
> wide row would allow me to have a list of every user within the time
> period that the row represented. My experience in the past was that
> using get_count on a row to get the column count got slow pretty quick
> but that might still be the easiest way to get the count of unique
> users with some sort of caching of the count so that it's not
> expensive subsequently. Using Hadoop is overkill for this scenario.
> Any other approaches?
>
> Ed
>
|