Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of bra@fsn.hu designates
 195.228.252.137 as permitted sender)
Message-ID: <50E99C57.1000006@fsn.hu>
Date: Sun, 06 Jan 2013 16:46:31 +0100
From: Attila Nagy <bra@fsn.hu>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.8.1.23) Gecko/20090817 Thunderbird/2.0.0.23 Mnenhy/0.7.6.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Schema recommendation
Content-Type: multipart/alternative;
 boundary="------------040302000101070601080309"

This is a multi-part message in MIME format.
--------------040302000101070601080309
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

I'm new to Cassandra, so I wonder what would be the most efficient(*) 
schema for specific operations on the following dataset:
- basically the task is to create a distributed file system with only 
few allowed operations
- it should handle split brain conditions well (two or three DCs, it's 
possible to get user requests while the intra-DC connections are down)
- each data file is blob (pretty often, but not always 8 bit text, 
sometimes without the encoding known), ranging from a few 100 bytes to 
around 20-30 MB. Average size is 350 kB.
- it's well compressible (gzip -1 gives 1.63x compress ratio on plain files)
- files are mostly immutable
- for the few, which are not, they are append-only
- each file has a unique file name (size in the range of 70-100 bytes) 
and it's preserved during its lifetime (if it changes, a rewrite is OK).
- the files have metadata attached (various, 1.2's sets and maps are a 
good fit here, but even simple columns should do)
- the files are organized into directories (multi level, and sometimes 
there can be up to some millions of files in a dir, but more likely are 
the range of 0-some hundred, thousands (up to 10k))
- directories also have metadata (most notably an mtime, which changes 
when directory contents are changed, that can be used to cache directory 
lists)
- each directory (and the files therein) belongs to a user(name)
- Cassandra 1.2 is fine

Given that designing schema for Cassandra begins with listing the 
operations, here they are:
1. get the contents of a directory (input: directory (owner) name, 
output:file names)
2. get a file (in: file (dir) name, out:file metadata and contents)
3. put a file (in: file (dir, owner) name, metadata, data)
4. append to a file, chunk size maximum is 4-8 kB (in: file (dir, owner) 
name, data)
5. move a file to a different directory (in: file, dir (owner) name, 
target dir)
6. remove a file (in: file (dir, owner) name)
7. remove all stuff for a user (input: user name), but it's "rare" 
(compared to the above), so walking through the dirlist on the client 
side is OK, it's not performance critical

All (minus the last) should be close to ACID.

I've tried to do the homework (taken a look at Cassandra in the 0.6-7 
times, so now I'm le-learning the new (1.2, CQL 3) way) and still 
couldn't find the best way.

I thought the best would be to start with the docs, without any 
preliminary performance testing.

This brought me the following schema:
CREATE TABLE file (
   name varchar PRIMARY KEY,
   owner varchar,
   dir varchar,
   flags set<ascii>,
   fstat map<ascii, int>,
   data list<blob>
);
CREATE INDEX file_dir ON file (dir);
CREATE INDEX file_owner ON file (owner);

Which gives me for the operations:
1. something like SELECT name(,etc) FROM file WHERE dir="dirname"; Which 
can be LIMIT-ed. Problems: SLOOOW (and maybe despite the LIMIT, it 
materializes in the coordinator's memory, I don't know), also, doesn't 
scale, because all nodes must inspect their index CFs.
2. a simple SELECT data(,etc) FROM file WHERE name="filename"; Problems: 
Cassandra is said not to good at storing such amounts of data, it has to 
read all in memory (on the coordinator and the replica node), also the 
client will have to hold it in memory. But it seems to be acceptible to 
some levels, because all data is needed, so fetching it once is an 
optimization. The limiting factor here seems to be the network speed 
(nodes pass the data as a hot potato, the slower the network, the longer 
it has to be kept in memory), and CPU speed.
3. a simple INSERT INTO or UPDATE
4. 1.2's lists make appends easy
5. most important. It's a manner of UPDATE file SET dir VALUES("newdir") 
WHERE name="filename"; This either happens, or not, there is no 
situation where the file is in multiple directories, or nowhere. Even if 
a site (node) doesn't get it, it still sees the file on the old 
location, and if there is a move there, eventually everything will get 
into the right shape, without worrying needed on the client side.
6. a simple DELETE
7. SELECT and DELETEs

For most operations, it seems fine (but your insightful recommendations 
are welcome :), the biggest pain seems to be listing directories.
It takes seconds from minutes (or timeouts).
I could add more CF(s) for example with composite columns (each 
directory being a row), but that would destroy the benefit of the above 
schema for moving files in one (atomic and idempotent) operation.

Is there a schema, which can do best for all operations and still 
maintain ACID-like properties?

Thanks,

*: in terms of query efficiency, closeness to ACID

--------------040302000101070601080309
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Hi,<br>
    <br>
    I'm new to Cassandra, so I wonder what would be the most
    efficient(*) schema for specific operations on the following
    dataset:<br>
    - basically the task is to create a distributed file system with
    only few allowed operations<br>
    - it should handle split brain conditions well (two or three DCs,
    it's possible to get user requests while the intra-DC connections
    are down)<br>
    - each data file is blob (pretty often, but not always 8 bit text,
    sometimes without the encoding known), ranging from a few 100 bytes
    to around 20-30 MB. Average size is 350 kB.<br>
    - it's well compressible (gzip -1 gives 1.63x compress ratio on
    plain files)<br>
    - files are mostly immutable<br>
    - for the few, which are not, they are append-only<br>
    - each file has a unique file name (size in the range of 70-100
    bytes) and it's preserved during its lifetime (if it changes, a
    rewrite is OK).<br>
    - the files have metadata attached (various, 1.2's sets and maps are
    a good fit here, but even simple columns should do)<br>
    - the files are organized into directories (multi level, and
    sometimes there can be up to some millions of files in a dir, but
    more likely are the range of 0-some hundred, thousands (up to 10k))<br>
    - directories also have metadata (most notably an mtime, which
    changes when directory contents are changed, that can be used to
    cache directory lists)<br>
    - each directory (and the files therein) belongs to a user(name)<br>
    - Cassandra 1.2 is fine<br>
    <br>
    Given that designing schema for Cassandra begins with listing the
    operations, here they are:<br>
    1. get the contents of a directory (input: directory (owner) name,
    output:file names)<br>
    2. get a file (in: file (dir) name, out:file metadata and contents)<br>
    3. put a file (in: file (dir, owner) name, metadata, data)<br>
    4. append to a file, chunk size maximum is 4-8 kB (in: file (dir,
    owner) name, data)<br>
    5. move a file to a different directory (in: file, dir (owner) name,
    target dir)<br>
    6. remove a file (in: file (dir, owner) name)<br>
    7. remove all stuff for a user (input: user name), but it's "rare"
    (compared to the above), so walking through the dirlist on the
    client side is OK, it's not performance critical<br>
    <br>
    All (minus the last) should be close to ACID.<br>
    <br>
    I've tried to do the homework (taken a look at Cassandra in the
    0.6-7 times, so now I'm le-learning the new (1.2, CQL 3) way) and
    still couldn't find the best way.<br>
    <br>
    I thought the best would be to start with the docs, without any
    preliminary performance testing.<br>
    <br>
    This brought me the following schema:<br>
    <tt>CREATE TABLE file (</tt><tt><br>
    </tt><tt>&nbsp; name varchar PRIMARY KEY,</tt><tt><br>
    </tt><tt>&nbsp; owner </tt><tt>varchar</tt><tt>,</tt><tt><br>
    </tt><tt>&nbsp; dir </tt><tt>varchar</tt><tt>,</tt><tt><br>
    </tt><tt>&nbsp; flags set&lt;ascii&gt;,</tt><tt><br>
    </tt><tt>&nbsp; fstat map&lt;</tt><tt>ascii</tt><tt>, int&gt;,</tt><tt><br>
    </tt><tt>&nbsp; data list&lt;blob&gt;</tt><tt><br>
    </tt><tt>);</tt><tt><br>
    </tt><tt>CREATE INDEX file_dir ON file (dir);</tt><tt><br>
    </tt><tt><tt>CREATE INDEX</tt> file_owner ON file (owner);</tt><br>
    <br>
    Which gives me for the operations:<br>
    1. something like SELECT name(,etc) FROM file WHERE dir="dirname";
    Which can be LIMIT-ed. Problems: SLOOOW (and maybe despite the
    LIMIT, it materializes in the coordinator's memory, I don't know),
    also, doesn't scale, because all nodes must inspect their index CFs.<br>
    2. a simple SELECT data(,etc) FROM file WHERE name="filename";
    Problems: Cassandra is said not to good at storing such amounts of
    data, it has to read all in memory (on the coordinator and the
    replica node), also the client will have to hold it in memory. But
    it seems to be acceptible to some levels, because all data is
    needed, so fetching it once is an optimization. The limiting factor
    here seems to be the network speed (nodes pass the data as a hot
    potato, the slower the network, the longer it has to be kept in
    memory), and CPU speed.<br>
    3. a simple INSERT INTO or UPDATE<br>
    4. 1.2's lists make appends easy<br>
    5. most important. It's a manner of UPDATE file SET dir
    VALUES("newdir") WHERE name="filename"; This either happens, or not,
    there is no situation where the file is in multiple directories, or
    nowhere. Even if a site (node) doesn't get it, it still sees the
    file on the old location, and if there is a move there, eventually
    everything will get into the right shape, without worrying needed on
    the client side.<br>
    6. a simple DELETE<br>
    7. SELECT and DELETEs<br>
    <br>
    For most operations, it seems fine (but your insightful
    recommendations are welcome :), the biggest pain seems to be listing
    directories.<br>
    It takes seconds from minutes (or timeouts).<br>
    I could add more CF(s) for example with composite columns (each
    directory being a row), but that would destroy the benefit of the
    above schema for moving files in one (atomic and idempotent)
    operation. <br>
    <br>
    Is there a schema, which can do best for all operations and still
    maintain ACID-like properties?<br>
    <br>
    Thanks,<br>
    <br>
    *: in terms of query efficiency, closeness to ACID<br>
  </body>
</html>

--------------040302000101070601080309--