incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Standefer <benstande...@gmail.com>
Subject Re: Fixing the data model names
Date Wed, 12 Aug 2009 18:32:00 GMT
View from a converting user (ie, non-committing lurker): I have spent 2-3
hours having Cassandra's data model explained to me in-person at the
hackathon, and the newly proposed language makes a lot more sense to me
right off the bat.  I strongly agree that specifically the naming and
verbiage of the data model poses a high barrier to entry.  The newly
proposed naming scheme conveys the concepts of Cassandra much more clearly.
Converting the column family -> thing with no name -> super column -> column
hierarchy to record collection -> record -> attribute collection ->
attribute removes incorrect connotations and analogies to tables, making it
easier for n00bs to understand that Cassandra is a structured key-value
store with a data model somewhere between memcached/BerkeleyDB and a folder
structure, rather than a table-based storage engine.

I really think the costs of renaming the data model (which Evan has
volunteered to bear the brunt of) should be weighted carefully against the
benefits gained from ease of adoption and increased interest.  If every new
Cassandra user has to power through 4 hours of in-person question-asking
with Cassandra experts to get the data model down, it could easily gain a
reputation for being overly complex to understand and use, when it's really
not too bad.

-Ben Standefer


On Wed, Aug 12, 2009 at 10:58 AM, Evan Weaver <eweaver@gmail.com> wrote:

> It seems so far we have Eric strongly against, and a few others as
> tentatively in favor, with caveats.
>
> Before I address the points specifically, I'd like to refer you to
> this API design manual from the QT team:
> http://chaos.troll.no/~shausman/api-design/api-design.pdf<http://chaos.troll.no/%7Eshausman/api-design/api-design.pdf>
> .
> Specifically, a quote: "It is better to have a system omit certain
> anomalous features and improvements, but to reflect one set of design
> ideas, than to have one that contains many good but independent and
> uncoordinated ideas." Right now we have the second, which is
> understandable, historically.
>
> Ok, onward.
>
> Re. Bill, I said cluster contains keyspaces/tables/databases, because
> multiple keyspaces can be defined within a cluster, as per the
> storage-conf.xml. That is all. I also mean it to refer to a physical
> collection of networked machines performing the same work.
>
> Re. Mark, I think collection is a mouthful too. Sets in math are not
> ordered, though, which makes me reluctant to support the use of the
> word "set".
>
> Re. Evans, it is true that Cassandra was influenced by Dynamo and
> BigTable. However, it is not merely a merge of those two. When I was
> getting started, everyone would say "Cassandra uses the BigTable"
> model, even though this was not actually the case. Super columns,
> local storage, and no column versioning are all significant and
> confusing diversions. Hypertable and Hbase cargo-cult^H^H^H^Hfollow
> that model strictly, so it makes more sense for them to keep the
> terminology.
>
> Database developers have read the BigTable and Dynamo papers. Database
> users have not. They will not, unless they are confused, and if they
> are confused, it will lead them further astray, because Cassandra's
> implementation has diverged.
>
> I disagree that the change would have a huge cost. A couple blog posts
> will be out of date. The Cassandra contributors (all 10 of them) will
> have to do a straightforward mental translation of terms for a few
> days before the new ones become comfortable. In my (statistically
> unsound) polls, the users, who don't even have a full grasp on the
> *current* terminology, will rejoice.
>
> BigTable's innovation was the data model, not the API. The source of
> our API problem is that in the BigTable paper, the API is directed
> towards a specific use case: a semi-column-oriented index store.
> However the data model itself is actually general, and that's what is
> interesting to our project. Things in the BigTable API that cause us
> significant problems:
>  * String-concatenated colon API (we fixed this).
>  * "Table", which prioritizes the column-oriented use, in direct
> opposition to the current use of the terminology (we fixed this,
> someplaces).
>  * Being called a "column store", again prioritizing the specific use
> case, which is falsely analogous to relational column stores (this was
> never really enshrined in Cassandra).
>  * Column "families", again prioritizing the specific use case
> (because it assumes that a document is spread across multiple
> families, and that a key, in isolation, refers to a globally unique
> document). Also a phrase used nowhere else in CS.
>  * Having "columns" which are neither tabular columns, or attributes
> stored in column-major order, but attributes stored in (surprise!)
> row-major order.
>
> Maybe "attribute" is interchangeable with "column" in the relational
> world, but it's used in the (even more widely known) object-oriented
> world too, to mean exactly what we need it to mean. In regards to
> "column family", maybe "attribute family" would be a suitable
> compromise, and be familiar to BigTable people. It's also a grouping
> of keys, and a grouping of records, so I don't know why "column
> family" makes more sense than "key family" or "record family", except
> for historical reasons. If we went with "attribute family", then we
> would have Cluster, Database, Attribute family, Record, Attribute, and
> Attribute collection. What's the difference between "Attribute family"
> and "Attribute collection"? We'd have to revert to the meaningless
> "super" to avoid a conflict, and it breaks the downward hierarchy of
> terms.
>
> For the things which do not have official names, "row", "record",
> etc., I don't think saying "you can call it what you want" is
> workable. I run across this currently at my job, trying to talk about
> things to other people. We settled on "row" but feel weird about it,
> because you can never quite be sure if someone else means the same
> thing you do. So it always requires an explanation.
>
> In regards to the better examples, I did the best possible job I could
> at
> http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/
> to give multiple clear examples. The post is very long, in a large
> part because the terminology is so foreign. Specifically for column
> and super column, I have to quite literally say "column: this is a
> tuple" and "super column: this is a named list". We should call them
> something that actually means that. More examples would help, but we
> need something else, because good examples are already there. This
> post got lots of readership, and is linked on the wiki, yet we still
> have confused users.
>
> I've been happily attacking other messed-up things in the Thrift API,
> with Jonathan's help. Those generate less controversy, so he's already
> committed tons of improvements. No more colons-concatenated API, no
> more modeling choices enshrined in the RPC names, abstracting and
> normalizing column references, etc. There's no reason this won't
> continue.
>
> Evan
>
> PS. It doesn't matter which version it goes into as long as it's before
> 1.0.
>
> On Wed, Aug 12, 2009 at 7:27 AM, Eric Evans<eevans@rackspace.com> wrote:
> > On Tue, 2009-08-11 at 22:34 -0700, Arin Sarkissian wrote:
> >> But realistically how much of this confusion could be avoided with a
> >> legit example? Once you see a good example you start getting it. A lot
> >> of people have been pointed towards the ThriftIterface page on the
> >> wiki which clears up next to nothing:
> >> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
> >> "edges", "base_attributes" etc. It's next door to nonsensical..
> >>
> >> What if we had a real example that people could relate to... a model a
> >> blog or something along those lines & update the
> >> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
> >> on the API methods would be used to accomplish basic tasks... ex: get
> >> all comments for a blog entry, list entires in time order, list
> >> entries tagged "bar", find all entries with "foo" in the body (kinda
> >> like the Facebook mail search example).
> >
> > Full ACK.
> >
> > Renaming everything carries a huge cost for (IMO) dubious benefit.
> > However, the cost-to-benefit ratio for better documentation and samples
> > seems excellent.
> >
> > --
> > Eric Evans
> > eevans@rackspace.com
> >
> >
>
>
>
> --
> Evan Weaver
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message