incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Evans <eev...@rackspace.com>
Subject Re: Fixing the data model names
Date Wed, 12 Aug 2009 04:38:57 GMT
On Tue, 2009-08-11 at 10:37 -0700, Evan Weaver wrote:
> In my experience, the naming of the data model has been a huge barrier
> to entry for users of Cassandra. This goes both for people familiar
> with SQL, and for people familiar with BigTable. I would like to
> change this before 0.4, since the 0.3 to 0.4 transition is the Great
> API Breakening.
> 
> I (that is, all of us at Twitter) are willing to write all the patches
> and update the wiki, if I get the necessary community buy-in. I hoped
> that I could do one patch per each external interface change, and then
> after those are complete, a patch for each internal interface change
> as a phase 2.
> 
> So technically this is not a bikeshed, because I'm happy to do all the
> work. I'll even submit a patch for Digg's Python client. Since there
> are no production deployments of ASF, and only a couple
> well-maintained clients, now is the time to break the world. A few
> hours of work now will pay off richly in terms of community
> involvement and reduced noob-explanation-time.
> 
> In general, I think the data model names should have the following goals:
> 
>  * Use existing, widely understood terms.
>  * Do not use terms that have conflicting meanings.
>  * Express analogies in the data model, where useful.
>  * Be unambiguous.
> 
> Are these goals valid? Clearly I think they are, because I wrote you a
> very long email about it. Also, I don't think the current names meet
> these goals. Currently, we have:
> 
>   Cluster, contains keyspaces:
> 
>   This is fine.
> 
>   Keyspace: contains column families.
> 
> There was some discussion of this change on the list a while back.
> Keyspace beats Table by a mile, due to the "conflicting existing
> usage" rule, but I think we can do better.
> 
>   Column family: containing a name, keys, column type, column sort,
> and sub column sort.
> 
>   This name is from BigTable, and not in wide usage. It does not
> express the hierarchy of storage, rather referring to a side effect of
> the storage hierarchy by talking about the most granular data objects.
> Confusing.

I disagree. 

We have a lot of people coming to us that have read the BigTable paper
(most? all?), and who are already familiar with the term "Column
family". If we change this, people will forever be mapping it from what
we call it, to "Column family", and that is not good.

To put it another way. A widely recognized publication has already
established the terminology for this.

It's also descriptive since the thing we call "Column family" is in fact
a grouping, or family, of columns.

>   Key: associated with columns.
> 
>   Since there's no word for the entire
> key-and-columns-in-a-column-family thing ("row"), it's hard to talk
> about this level of the data model clearly.

Actually, I think "row" works just fine, and without being enshrined in
the interface.

>   Column: containing a name, value, and timestamp.
> 
>   This is from BigTable. In most cases, except when contained within a
> super column, the data is row-oriented. There is nothing inherently
> columnar about the storage. Furthermore, column is widely understood
> from SQL to mean a table-enforced, strongly typed slot. Since
> Cassandra does not have a tabular model, this is straight-up wrong.
> Timestamps are an additional unexpected innovation in the normal use
> of "column".

Another word for column in object relational parlance is "attribute".

>   Super column, containing a name and columns.
> 
>   This is a container of columns. However, the name expresses some
> kind of priority order, but nothing about the container nature, even
> though that's the most important property. This is not in any other
> usage anywhere, and will always require explanation. Despite being a
> type of column, it cannot be updated or overwritten like a standard
> column, and does not have a timestamp.
> 
> Try to approach the naming with the mind of a beginner. For what it's
> worth, it took me at least 6 weeks to become comfortable with the
> current Cassandra terminology, and I had many false assumptions based
> on the names. I remember it took far less than that when starting out
> with SQL. At least there you can defer the confusing parts until
> later; Cassandra hits you with the confusion all up front. Just
> because we are comfortable now, doesn't mean that the current names
> are a good thing.

> So, on to the new proposed naming. In Cassandra's implementation, each
> level of the data model contains the totality of the lower levels.
> I've tried to express that in the new names.
> 
>   Cluster.
> 
>   No change.
> 
>   Database (formerly keyspace formerly table).
> 
>   Since this is quite literally the same as a database in an RDMBS,
> there's no reason to change the term. It's a namespace with a specific
> set of storage flags flipped. Its usage is analogous to the same usage
> in an RDBMS.
> 
>   Record collection (formerly column family).

If a record is analogous to a row, than a "record collection" seems to
be a very confusing way of describing a column family (or attributes if
you will).

>   This expresses the container nature--an ordered set. The word
> "collection" is used in document databases to mean the same thing.
> 
>   Record (formerly a-thing-without-a-name)
> 
>   This is the row itself. It has a key, and attributes, but the thing
> itself is not a key. It is not a "document" because it does not
> arbitrarily nest, and it's not "row" because that might imply the
> tabular nature of an RDBMS. Record has a history in databases which is
> reasonable in this context. It does not imply that a record
> necessarily corresponds to a complete object in the application, but
> it doesn't rule it out. Since this is the only thing that has a key,
> it's still valid to refer to a "key" in isolation, when convenient.

Like "row" above, I think you can use the term "record" when describing
the the unit of storage without enshrining it in the interface.

>  Attribute (formerly column).
> 
>  It has a name, value, and a timestamp. It does not imply anything
> about the storage. It does not imply a tabular model. It's more
> specific then "tuple", but easier to talk about than "timestamped
> key/value pair". It's the same as attributes in any object system.
> 
>  Attribute collection (formerly super column).
> 
>  This is clearly a container of attributes. That is all it implies,
> and that is what it is. It is analogous to record collection.

As noted earlier, "attribute" is another way of referring to a column
when talking about a relational databases. IMO, if column is confusing,
attribute is worse.

> In short:
> 
>   Cluster
>   Database
>   Record collection
>   Record
>   Attribute collection
>   Attribute
> 
> We could call the cluster "database collection", but even I'm not
> going to go that far. I realize that each level is merely a collection
> of the collections under it, but an "attribute collection collection
> collection collection" is no help to day-to-day usage. ;-)
> 
> As a heuristic, do the current names help, or get in the way? I'm not
> married to the new proposal, but I want us to move in the right
> direction, and not act like the current unusual naming is a badge of
> honor, or forget our own difficulties in getting started.
> 
> Keep in mind that BigTable, as an internal Google project, did not
> have API clarity as a primary goal; witness the colon-string-API that
> got copied by Cassandra originally.
> 
> Comments please!

You're proposing some pretty disruptive changes, and as such the benefit
needs to be clear and obvious, IMO it's not.

The timing is also pretty bad considering we're nearing the end of the
0.4 roadmap, and this wasn't on the list.

-- 
Eric Evans
eevans@rackspace.com


Mime
View raw message