cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Cruz <>
Subject Query advice to prevent node overload
Date Fri, 14 Sep 2012 10:56:13 GMT

I have a schema that represents a filesystem and one example of a Super CF is:

CF FilesPerDir: (DIRNAME -> (FILENAME -> (attribute1: value1, attribute2: value2))

And in cases of directory moves, I have to fetch all files of that directory and subdirectories.
This implies one cassandra query per dir, or a multiget for all needed dirs. The multiget()
is more efficient, but I'm having trouble trying to limit the size of the data returned in
order to not crash the cassandra node.

I'm using the pycassa client lib, and until now I have been doing per-directory get()s specifiying
a column_count. This effectively limits the size of the dataset, but I would like to perform
a multiget() to fetch the contents of multiple dirs at a time. The problem is that it seems
that the column_count is per-key, and not global per dataset. So if I set column_count = 10000,
as I have now, but fetch 1000 dirs (rows) and each one happens to have 10000 files (columns)
the dataset is 1000x10000. Is there a better way to query for this data or does multiget deal
with this through the "buffer_size"?

View raw message