On Sep 17, 2012, at 3:04 AM, aaron morton <aaron@thelastpickle.com> wrote:

I have a schema that represents a filesystem and one example of a Super CF is:
This may help with some ideas

In general we advise to avoid Super Columns if possible. They are often slower, and the sub columns are not indexed. Meaning all the sub columns have to be read into memory. 

So if I set column_count = 10000, as I have now, but fetch 1000 dirs (rows) and each one happens to have 10000 files (columns) the dataset is 1000x10000.
This is the way the query works internally. Multiget is simply a collections of independent gets. 

The multiget() is more efficient, but I'm having trouble trying to limit the size of the data returned in order to not crash the cassandra node.
Often less is more. I would only ask for a few 10's of rows at a time, or try to limit the size of the returned query to a few MB's. Otherwise a lot of data get's dragged through cassandra, the network and finally Python. 

You may want to consider a CF like the inode CF it the article above. Where the parent dir is a column with a secondary index. 

Thanks Aaron! I will take your points into consideration.

Best regards,