Hi Aaron and Martin,

Sorry about my previous reply, I thought you wanted to process only all the row keys in CF.

I have a similar issue as Martin because I see myself being forced to hit more than a million rows with a query (I only get a few columns from every row). Aaron, we've talked about this in another thread, basically I am constrained to ship out a window of data from my online cluster to an offline cluster. For this I need to read for example 5 min window of all the data I have. This simply accesses too many rows and I am hitting the I/O limit on the nodes. As I understand for every row it will do 2 random disk seeks (I have no caches).

My question is, what can I do to improve the performance of shipping windows of data entirely out?

Martin, did you use Hadoop as Aaron suggested? How did that work with Cassandra? I don't understand how accessing 1 million of rows through map reduce jobs be any faster?

Cheers,
Alexandru
 

On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com> wrote:
If you want to process 1 million rows use Hadoop with Hive or Pig. If you use Hadoop you are not doing things in real time. 

You may need to rephrase the problem. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:

Hi Experts,

My program is such that it queries all keys on Cassandra. I want to do this as quick as possible, in order to get as close to real-time as possible.

One solution I heard was to use the sstables2json tool, and read the data in as JSON. I understand that reading from each line in Cassandra might take longer.

Are there any other ideas for doing this ? Or can you confirm that sstables2json is the way to go.

Querying 100 rows in Cassandra the normal way is fast enough. I'd like to query a million rows, do some calculations on them, and spit out the result like it's real time.

Thanks for any help you can give,

Martin