Hi Kasper,

I'd suggest taking a look at Spark, Storm, or Samza (all are Apache projects) for a possible approach. Depending on your needs and your existing infrastructure, one of those may work better than others for you.


On Tue, Apr 1, 2014 at 2:51 AM, Kasper Petersen <kasper@sybogames.com> wrote:

I have a large amount (can be >100 million) of (id uuid, score int) entries in Cassandra. I need to, at regular intervals of lets say 30-60 minutes, find the cut-off points for the score needed to be in the top 0.1%, 33% and 66% of all scores.

What would a good approach be to this problem?

All the data wont fit into memory thus using regular sorting on the application side won't be possible (unless I do it using a merge sort algorithm with files, which feels like a bad solution).

Iterating over the data once and build a histogram would cut down the required memory usage quite significantly, but I'm afraid this could still end up being "too big". Are there any easier ways to do these computations?

Lastly I've thought about the possibility to use analytics tools to compute these things for me - would setting up hadoop and/or pig help me do this in a manner that could make the results accessible to the application servers once done? I've had a hard time finding any guides on how to set it up and what exactly I'd be able to do with it afterwards. Any pointers would be much appreciated.

Best regards,

Steve Robenalt
Software Architect
HighWire | Stanford University 
425 Broadway St, Redwood City, CA 94063