Hi Aaron and Martin,
Sorry about my previous reply, I thought you wanted to process only all the row keys in CF.
I have a similar issue as Martin because I see myself being forced to hit more than a million rows with a query (I only get a few columns from every row). Aaron, we've talked about this in another thread, basically I am constrained to ship out a window of data from my online cluster to an offline cluster. For this I need to read for example 5 min window of all the data I have. This simply accesses too many rows and I am hitting the I/O limit on the nodes. As I understand for every row it will do 2 random disk seeks (I have no caches).
My question is, what can I do to improve the performance of shipping windows of data entirely out?
Martin, did you use Hadoop as Aaron suggested? How did that work with Cassandra? I don't understand how accessing 1 million of rows through map reduce jobs be any faster?
Cheers,
Alexandru
If you want to process 1 million rows use Hadoop with Hive or Pig. If you use Hadoop you are not doing things in real time.You may need to rephrase the problem.CheersOn 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:Hi Experts,
My program is such that it queries all keys on Cassandra. I want to do this as quick as possible, in order to get as close to real-time as possible.
One solution I heard was to use the sstables2json tool, and read the data in as JSON. I understand that reading from each line in Cassandra might take longer.
Are there any other ideas for doing this ? Or can you confirm that sstables2json is the way to go.
Querying 100 rows in Cassandra the normal way is fast enough. I'd like to query a million rows, do some calculations on them, and spit out the result like it's real time.
Thanks for any help you can give,
Martin