incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Query advice to prevent node overload
Date Mon, 17 Sep 2012 02:04:57 GMT
> I have a schema that represents a filesystem and one example of a Super CF is:
This may help with some ideas
http://www.datastax.com/dev/blog/cassandra-file-system-design

In general we advise to avoid Super Columns if possible. They are often slower, and the sub
columns are not indexed. Meaning all the sub columns have to be read into memory. 


> So if I set column_count = 10000, as I have now, but fetch 1000 dirs (rows) and each
one happens to have 10000 files (columns) the dataset is 1000x10000.
This is the way the query works internally. Multiget is simply a collections of independent
gets. 

 
> The multiget() is more efficient, but I'm having trouble trying to limit the size of
the data returned in order to not crash the cassandra node.
Often less is more. I would only ask for a few 10's of rows at a time, or try to limit the
size of the returned query to a few MB's. Otherwise a lot of data get's dragged through cassandra,
the network and finally Python. 

You may want to consider a CF like the inode CF it the article above. Where the parent dir
is a column with a secondary index. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 10:56 PM, André Cruz <andre.cruz@co.sapo.pt> wrote:

> Hello.
> 
> I have a schema that represents a filesystem and one example of a Super CF is:
> 
> CF FilesPerDir: (DIRNAME -> (FILENAME -> (attribute1: value1, attribute2: value2))
> 
> And in cases of directory moves, I have to fetch all files of that directory and subdirectories.
This implies one cassandra query per dir, or a multiget for all needed dirs. The multiget()
is more efficient, but I'm having trouble trying to limit the size of the data returned in
order to not crash the cassandra node.
> 
> I'm using the pycassa client lib, and until now I have been doing per-directory get()s
specifiying a column_count. This effectively limits the size of the dataset, but I would like
to perform a multiget() to fetch the contents of multiple dirs at a time. The problem is that
it seems that the column_count is per-key, and not global per dataset. So if I set column_count
= 10000, as I have now, but fetch 1000 dirs (rows) and each one happens to have 10000 files
(columns) the dataset is 1000x10000. Is there a better way to query for this data or does
multiget deal with this through the "buffer_size"?
> 
> Thanks,
> André


Mime
View raw message