incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Weaver <>
Subject Re: Fixing the data model names
Date Thu, 13 Aug 2009 17:24:20 GMT

What do you see as the benefit of ColumnFamily? As you mentioned on
the -users list, it's not the common case in Cassandra for a document
to span multiple column families. But that seemed to be the motivation
for naming it that in Bigtable--a "row" usually spanned multiple files
(to get the semi-column-oriented thing going on), so you had to have
some name for the individual groups of columns across the distributed
row. This is also the reason that the API parameter order was merely
not storage order.

When anyone asks me (and they always ask) I just tell them it's like a
table. Though I am warming up to the subcolumn thing, I still don't
think it makes sense to talk about columns in a non-tabular
multidimensional space. If "attributes" mean columns in
relational-theory land, then attributes are wrong too.

It's true that in Cassandra you can talk to any keyspace. On the other
hand, you told me to design my client API as if you couldn't,  and
declared the key space at client instantiation. From a usability
perspective I agree that that is correct, but I'm confused about the
intention. So it seems like the clients should not treat the Thrift
API as a design foundation.

I'm going to make a list of data model APIs for other databases and
see if anything falls out.


On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<> wrote:
> I agree with the proposition that the SuperColumn name is weak.
> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
> go with schema over keyspace.
> One option to deal with SC would be to excise the term SC (and SCF
> from the config) and instead just have Columns, which may or may not
> have SubColumns.  You would define this as
> <ColumnFamily withSubColumns="true" .../>
> "Insert a subcolumn named A into the Column named B" fits pretty well
> with how I think of things working.  And now you just have Rows and
> Columns!  Just like a RDB! :P
> -Jonathan
> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<> wrote:
>> Points taken, and I agree, except in my experience the current names
>> are not Pretty Good but rather Pretty Weird; the primary issues being
>> column family and super column.
>> If we go by the shorter-is-better principle, we might get:
>> Cluster
>> Schema
>> Row set
>> Row w/key
>> Field set
>> Field
>> "You take the user's key, and use that to insert into the Row Set
>> 'user_associations' at Field Set 'user_timeline,' a field named with a
>> time-based UUID representing now, and with a value of the new tweet's
>> key."
>> But let me study for a while and come up with a more researched proposal.
>> Evan
>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<> wrote:
>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<>
>>>> However I think it's worth considering this from a strategic
>>>> perspective, looking at how we want the project do grow and change,
>>>> rather than just as it is right now.  The key to successful adoption
>>>> is having a successful elevator pitch,  you can start using a database
>>>> without understanding relational-algebra because 'table' and 'column'
>>>> are such simple ways to reason about the tool.  As it stands
>>>> cassandra's takes a whiteboard and 15 minutes, before people get what
>>>> you're talking about.
>>> If you want to explain it as "sort of like a relational db" then
>>> table -> CF
>>> column -> column
>>> key -> key
>>> row -> row
>>> That's the simple case, then all you have is "supercolumns can contain
>>> a list of simple columns."
>>> That really doesn't seem so hard to me.  I have explained this to *managers*.
>>>> Assuming the project gets anything like the adoption it deserves, the
>>>> users we have today will be a *tiny minority* of the users we have in
>>>> the future.  So imposing costs on the current userbase which will give
>>>> huge benefits to future users, should be something we're willing to
>>>> do.  In fact it's something that has been done repeatedly over the
>>>> last few weeks.
>>> I agree.  But as I said before I just don't see this as being an improvement.
>>>> Given those changes went in without debate, I'm not sure what the
>>>> reluctance is for making changes to the nomenclature for the project.
>>> As above.
>>>> Speaking as someone who's only been doing this a month, the naming is
>>>> *still* confusing, and when I talk with people who wonder what
>>>> cassandra is all about I get blank looks when telling them what things
>>>> are called.  If you step back and want to tell someone how you'd
>>>> insert a tweet into someone's timeline using evan's weblog post:
>>>>  "You just take the user's key, and use that to insert into the
>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline', a
>>>> ColumnName of a time based uuid representing now, and a value of the
>>>> new tweet's key"
>>>> Column is in the name of 3 of the 5 concepts expressed, and in each
>>>> cases it's different.
>>> When you're inserting something nested 3 levels deep a certain amount
>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>> "You take the user's record ID, and use that to insert into the Record
>>> Collection 'user associations' at Attribute Collection
>>> 'user_timeline,' an Attribute named with a time based uuid
>>> representing now, and with a value of the new tweet's key."
>>> I think that is a negative improvement.  Yay, now we are talking about
>>> Attribute Collections and Attributes instead of SuperColumns and
>>> Columns.  The same objections ("one object's name contains the
>>> other's!) apply, plus the new one of sounding so generic that it could
>>> apply to practically any system.
>>> -Jonathan
>> --
>> Evan Weaver

Evan Weaver

View raw message