incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Weaver <ewea...@gmail.com>
Subject Fixing the data model names
Date Tue, 11 Aug 2009 17:37:29 GMT
Dear Cassandra Developers,

In my experience, the naming of the data model has been a huge barrier
to entry for users of Cassandra. This goes both for people familiar
with SQL, and for people familiar with BigTable. I would like to
change this before 0.4, since the 0.3 to 0.4 transition is the Great
API Breakening.

I (that is, all of us at Twitter) are willing to write all the patches
and update the wiki, if I get the necessary community buy-in. I hoped
that I could do one patch per each external interface change, and then
after those are complete, a patch for each internal interface change
as a phase 2.

So technically this is not a bikeshed, because I'm happy to do all the
work. I'll even submit a patch for Digg's Python client. Since there
are no production deployments of ASF, and only a couple
well-maintained clients, now is the time to break the world. A few
hours of work now will pay off richly in terms of community
involvement and reduced noob-explanation-time.

In general, I think the data model names should have the following goals:

 * Use existing, widely understood terms.
 * Do not use terms that have conflicting meanings.
 * Express analogies in the data model, where useful.
 * Be unambiguous.

Are these goals valid? Clearly I think they are, because I wrote you a
very long email about it. Also, I don't think the current names meet
these goals. Currently, we have:

  Cluster, contains keyspaces:

  This is fine.

  Keyspace: contains column families.

There was some discussion of this change on the list a while back.
Keyspace beats Table by a mile, due to the "conflicting existing
usage" rule, but I think we can do better.

  Column family: containing a name, keys, column type, column sort,
and sub column sort.

  This name is from BigTable, and not in wide usage. It does not
express the hierarchy of storage, rather referring to a side effect of
the storage hierarchy by talking about the most granular data objects.
Confusing.

  Key: associated with columns.

  Since there's no word for the entire
key-and-columns-in-a-column-family thing ("row"), it's hard to talk
about this level of the data model clearly.

  Column: containing a name, value, and timestamp.

  This is from BigTable. In most cases, except when contained within a
super column, the data is row-oriented. There is nothing inherently
columnar about the storage. Furthermore, column is widely understood
from SQL to mean a table-enforced, strongly typed slot. Since
Cassandra does not have a tabular model, this is straight-up wrong.
Timestamps are an additional unexpected innovation in the normal use
of "column".

  Super column, containing a name and columns.

  This is a container of columns. However, the name expresses some
kind of priority order, but nothing about the container nature, even
though that's the most important property. This is not in any other
usage anywhere, and will always require explanation. Despite being a
type of column, it cannot be updated or overwritten like a standard
column, and does not have a timestamp.

Try to approach the naming with the mind of a beginner. For what it's
worth, it took me at least 6 weeks to become comfortable with the
current Cassandra terminology, and I had many false assumptions based
on the names. I remember it took far less than that when starting out
with SQL. At least there you can defer the confusing parts until
later; Cassandra hits you with the confusion all up front. Just
because we are comfortable now, doesn't mean that the current names
are a good thing.

So, on to the new proposed naming. In Cassandra's implementation, each
level of the data model contains the totality of the lower levels.
I've tried to express that in the new names.

  Cluster.

  No change.

  Database (formerly keyspace formerly table).

  Since this is quite literally the same as a database in an RDMBS,
there's no reason to change the term. It's a namespace with a specific
set of storage flags flipped. Its usage is analogous to the same usage
in an RDBMS.

  Record collection (formerly column family).

  This expresses the container nature--an ordered set. The word
"collection" is used in document databases to mean the same thing.

  Record (formerly a-thing-without-a-name)

  This is the row itself. It has a key, and attributes, but the thing
itself is not a key. It is not a "document" because it does not
arbitrarily nest, and it's not "row" because that might imply the
tabular nature of an RDBMS. Record has a history in databases which is
reasonable in this context. It does not imply that a record
necessarily corresponds to a complete object in the application, but
it doesn't rule it out. Since this is the only thing that has a key,
it's still valid to refer to a "key" in isolation, when convenient.

 Attribute (formerly column).

 It has a name, value, and a timestamp. It does not imply anything
about the storage. It does not imply a tabular model. It's more
specific then "tuple", but easier to talk about than "timestamped
key/value pair". It's the same as attributes in any object system.

 Attribute collection (formerly super column).

 This is clearly a container of attributes. That is all it implies,
and that is what it is. It is analogous to record collection.

In short:

  Cluster
  Database
  Record collection
  Record
  Attribute collection
  Attribute

We could call the cluster "database collection", but even I'm not
going to go that far. I realize that each level is merely a collection
of the collections under it, but an "attribute collection collection
collection collection" is no help to day-to-day usage. ;-)

As a heuristic, do the current names help, or get in the way? I'm not
married to the new proposal, but I want us to move in the right
direction, and not act like the current unusual naming is a badge of
honor, or forget our own difficulties in getting started.

Keep in mind that BigTable, as an internal Google project, did not
have API clarity as a primary goal; witness the colon-string-API that
got copied by Cassandra originally.

Comments please!

Thanks,

Evan

-- 
Evan Weaver

Mime
View raw message