incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Curt Micol <>
Subject Re: Cassandra data model misconceptions, and their sources
Date Tue, 18 Aug 2009 05:33:24 GMT
I've been thinking about this for a number of days, and again, while I am not a
developer I thought I might toss in a proposal if that's okay.

Since putting together a schema diagram and having a number of people review
it, I think a change is warranted. Too many people are coming from the RDBMS
world and the terms used by Cassandra are conflicting with those terms they
are already familiar with.

The TLDR version is as follows:

Object (Column)
ObjectFamily (ColumnFamily)
Directory (Row)
ObjectContainer (SuperColumn)
Namespace (Keyspace)

The long version...

Object (Column)
As Evan has stated repeatedly, column is a bit misleading especially when
compared to other types of database systems.  I think this is probably the
most important change to the data model names, and exactly where I started
since this is the 'core' of Cassandra.  Object gives the impression that this
is a piece of data, it's relatively structured but the name gives no
impression how strict that structure is. 'Objects' have names that have values
and timestamps. Simple and too the point. 'Object' doesn't come with the
preconceived notions that 'column' comes with and leaves room for Cassandra to
define what an 'object' is without any conflict to preexisting data

By changing this, we can move up the ladder to other data types and
easily rename them to something that 'contains objects' or 'accesses objects'.
This allows us to describe the data model in the name structure without
having to get too deep into the definition.

Directory (Row)
'row' is currently unnamed, but still a structure that exists in the model.
It's not specifically data itself, but more of a mapping of how to get to
objects (using keys). 'Directory' fills this void quite well. It is easily
explained as a path to get to data and not data itself.

ObjectFamily (ColumnFamily)
There's no argument that the one direct link to the BigTable paper is 'column
families'. It's perhaps the only structure that is virtually the same in both
pieces of software.  Considering this, I think we need to avoid too drastic a
change.  With that said, I think a change is necessary due to the differences
in columns between the two databases. 'object family' is descriptive of the
relation between objects and removes any reference to tabular structures while
keeping a loose relationship to 'column family' in the BigTable paper.

ObjectContainer (SuperColumn)
I could see this being shortened to 'container' in every day conversation.
However, 'objectcontainer' fits nicely with the rest of the data model names
and is descriptive of it's purpose and use. Ultimately a 'supercolumn' is
nothing more than a named container of columns (and I've seen on at least 3
different occasions the word container used to describe supercolumns).
'supercolumn' had no real connection to what exactly it was defining, but with
'object container' we have a clear understanding that we are naming the
structure that holds objects. Or as I explained it to a friend, we are naming
the 'jar' and not the 'honey'. :)

Namespace (Keyspace)
This one I go back and forth on. I know it's been changed from 'Table' to
'keyspace' and Evan proposed 'database', but I think that 'namespace' is
really what it is we are talking about. Wikipedia has this as the first line
to describe 'namespace':

A namespace is an abstract container or environment created to hold a
logical grouping of unique identifiers or symbols (i.e., names).

Originally I thought 'objectspace' would fit better, but I think 'namespace'
comes with a better history and is clearer to what this structure really is.
Especially when you relate the name namespace to how it is used in Ruby, Python
and Java. Ultimately though, I think I prefer 'keyspace' over 'table'
or 'database'.

The only issue I see with all of these names is the potential conflict with
programming languages and their objects. I know next to nothing about Java so
I don't know if there would be a conflict here. I've ran the following Google
search 'reserved words in *' where '*' is Ruby, Python, Java and C++ and
received no mention of 'object' being a reserved word in any of those

I also grep'd through current source code and there doesn't seem to be any
real conflicts that couldn't be named something else so as not to conflict
with this naming structure.

In the end, I think it's a good idea to look at this and work out a solution.
Documentation and tutorials are going to help, but I think people are so
entrenched in the RDBMS world that there is somewhat of a barrier to
understanding Cassandra's data model.

Thanks for your time,

# Curt Micol

View raw message