I'm trying to optimize moving data from Cassandra to HDFS using either
Ruby or Python client. Right now, I'm playing around on my staging
server, an 8 GB single node machine. My data in Cassandra (1.0.8)
consist of 2 rows (for now) with ~150k super columns each (I know, I
know - super columns are bad). Every super column has ~25 columns
totaling ~800 bytes per super column.
I should also mention that currently the database is static - there are no writes/updates, only reads.
Anyways, in my python/ruby scripts, I'm taking slices of 5000
supercolumns long from a single row. It takes 13 seconds with ruby and 8
seconds with pycassa to get a single slice. Or, in other words, it's
currently reading at speeds of less than 500 kB per second. The speed
seems to be linear with the length of a slice (i.e. 6 seconds for 2500
scs for ruby). If I run nodetool cfstats while my script is running, it
tells me that my read latency on the column family is ~300ms.
I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance.