cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kasper Petersen <>
Subject Finding cut-off points
Date Tue, 01 Apr 2014 09:51:11 GMT

I have a large amount (can be >100 million) of (id uuid, score int) entries
in Cassandra. I need to, at regular intervals of lets say 30-60 minutes,
find the cut-off points for the score needed to be in the top 0.1%, 33% and
66% of all scores.

What would a good approach be to this problem?

All the data wont fit into memory thus using regular sorting on the
application side won't be possible (unless I do it using a merge sort
algorithm with files, which feels like a bad solution).

Iterating over the data once and build a histogram would cut down the
required memory usage quite significantly, but I'm afraid this could still
end up being "too big". Are there any easier ways to do these computations?

Lastly I've thought about the possibility to use analytics tools to compute
these things for me - would setting up hadoop and/or pig help me do this in
a manner that could make the results accessible to the application servers
once done? I've had a hard time finding any guides on how to set it up and
what exactly I'd be able to do with it afterwards. Any pointers would be
much appreciated.

Best regards,

View raw message