cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Robenalt <>
Subject Re: Finding cut-off points
Date Tue, 01 Apr 2014 15:54:22 GMT
Hi Kasper,

I'd suggest taking a look at Spark, Storm, or Samza (all are Apache
projects) for a possible approach. Depending on your needs and your
existing infrastructure, one of those may work better than others for you.


On Tue, Apr 1, 2014 at 2:51 AM, Kasper Petersen <>wrote:

> Hi,
> I have a large amount (can be >100 million) of (id uuid, score int)
> entries in Cassandra. I need to, at regular intervals of lets say 30-60
> minutes, find the cut-off points for the score needed to be in the top
> 0.1%, 33% and 66% of all scores.
> What would a good approach be to this problem?
> All the data wont fit into memory thus using regular sorting on the
> application side won't be possible (unless I do it using a merge sort
> algorithm with files, which feels like a bad solution).
> Iterating over the data once and build a histogram would cut down the
> required memory usage quite significantly, but I'm afraid this could still
> end up being "too big". Are there any easier ways to do these computations?
> Lastly I've thought about the possibility to use analytics tools to
> compute these things for me - would setting up hadoop and/or pig help me do
> this in a manner that could make the results accessible to the application
> servers once done? I've had a hard time finding any guides on how to set it
> up and what exactly I'd be able to do with it afterwards. Any pointers
> would be much appreciated.
> Best regards,
> Kasper

Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

View raw message