Hi Aaron and Martin,

Sorry about my previous reply, I thought you wanted to process only all the row keys in CF.

I have a similar issue as Martin because I see myself being forced to hit more than a million rows with a query (I only get a few columns from every row). Aaron, we've talked about this in another thread, basically I am constrained to ship out a window of data from my online cluster to an offline cluster. For this I need to read for example 5 min window of all the data I have. This simply accesses too many rows and I am hitting the I/O limit on the nodes. As I understand for every row it will do 2 random disk seeks (I have no caches).

My question is, what can I do to improve the performance of shipping windows of data entirely out?

Martin, did you use Hadoop as Aaron suggested? How did that work with Cassandra? I don't understand how accessing 1 million of rows through map reduce jobs be any faster?


On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com> wrote:
If you want to process 1 million rows use Hadoop with Hive or Pig. If you use Hadoop you are not doing things in real time. 

You may need to rephrase the problem. 


Aaron Morton
Freelance Developer

On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:

Hi Experts,

My program is such that it queries all keys on Cassandra. I want to do this as quick as possible, in order to get as close to real-time as possible.

One solution I heard was to use the sstables2json tool, and read the data in as JSON. I understand that reading from each line in Cassandra might take longer.

Are there any other ideas for doing this ? Or can you confirm that sstables2json is the way to go.

Querying 100 rows in Cassandra the normal way is fast enough. I'd like to query a million rows, do some calculations on them, and spit out the result like it's real time.

Thanks for any help you can give,