you might also want to try to see if it's due to disk seeking.

you verify this by increasing your memory/heap size, or writing your files to a ram disk /tmpfs



On Wed, Aug 31, 2011 at 4:57 PM, Dan Kuebrich <dan.kuebrich@gmail.com> wrote:
There might be some tuning you can do--key cache, etc--though I can't speak to that in your particular case and with 50 column families you'd probably run into pretty bad memory limits.

However, having found myself in a similar situation in the past, you might consider experimentally trying different batch sizes on the # of rows (eg 1 request for 900 vs 9 for 100 each, etc).  This has helped me solve timeout problems when retrieving "large" numbers of rows in the past and reduced overall retrieval time.  I know that at least the pycassa client supports this type of multiget out of the box.

On Wed, Aug 31, 2011 at 5:13 AM, Renato Bacelar da Silveira <renatods@indabamobile.co.za> wrote:
Hi All

I am running a query against a node with about 50 Column Families.

At present One of the column families has 2,502,000 rows, each row
contains 100 columns.

I am searching for 3 columns specifically, and am doing so with Thrift's
multiget_slice(). I prepare a statement with about 900 row  keys, each
searching for a slice of 3 specific columns.

My average time taken to return from the multiget_slice() is about 4
seconds. I performed a comparative query in mysql, and the results
were returned to me in 0.75 seconds or avarage.

Is 4 seconds way too much time for Cassandra? I am sure this could
be under 1 second, like MySql.

I have resized the Thrift transport size to just 1MB so to not encounter
any timeouts, as noted if you push too many queries through. Is this
a correct assumption?

So is it too much to push 900 keys in a multiget_slice() at once? I read
that it does a concurrent fetch. I can understand threads racing for
cycles, causing waits, but somehow I think I am wrong somewhere.

Regards to ALL!



Renato da Silveira
Senior Developer
www.indabamobile.co.za



--