Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 3935 invoked from network); 12 Aug 2009 17:59:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Aug 2009 17:59:43 -0000 Received: (qmail 89022 invoked by uid 500); 12 Aug 2009 17:59:50 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 89004 invoked by uid 500); 12 Aug 2009 17:59:50 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 88990 invoked by uid 99); 12 Aug 2009 17:59:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Aug 2009 17:59:50 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of eweaver@gmail.com designates 209.85.210.177 as permitted sender) Received: from [209.85.210.177] (HELO mail-yx0-f177.google.com) (209.85.210.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Aug 2009 17:59:39 +0000 Received: by yxe7 with SMTP id 7so244177yxe.32 for ; Wed, 12 Aug 2009 10:59:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=2mqm9WsQ2SFo5c/oqDGh9JJdt0nHHoFMOEFRGa56+7M=; b=qJsTBo8zW47yrftgghvu46DcmQIHCYLHp08rW7nPrz+selhJ8d6E/WLy02UzK/VcfJ E/IwekleqM1bAVsTsAenklq0opei6i0Cd7jgIMArzXHvvk/1qElz+t4qaL+9Z4RMkGq8 40QtQc0k7p28B/EPcqcRW7BD5+564Rq6JMFRs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=YjGNCer0S0DGfdnxu9miIKZhnIvlkFiOTiZPIWb59THPRcg6a+LVaCRSKS+30JnbZU wfvUyZOJEF3/8AdJkxnts+zQM9MCSRG6VLcHVsZY7dUvF1HNRnWqymF9k576WBcHs90R eHFPZUj3ZPBPYcWgsN7zE19hLNbnas9Wsie14= MIME-Version: 1.0 Received: by 10.150.53.4 with SMTP id b4mr558785yba.15.1250099957340; Wed, 12 Aug 2009 10:59:17 -0700 (PDT) In-Reply-To: <1250087250.23994.198.camel@achilles> References: <1250051937.23994.179.camel@achilles> <8d9c091a0908112209s11703850td69973809de7cc74@mail.gmail.com> <991dde790908112234g5ecafaa5kec3f4387ff11d321@mail.gmail.com> <1250087250.23994.198.camel@achilles> From: Evan Weaver Date: Wed, 12 Aug 2009 10:58:57 -0700 Message-ID: Subject: Re: Fixing the data model names To: cassandra-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org It seems so far we have Eric strongly against, and a few others as tentatively in favor, with caveats. Before I address the points specifically, I'd like to refer you to this API design manual from the QT team: http://chaos.troll.no/~shausman/api-design/api-design.pdf. Specifically, a quote: "It is better to have a system omit certain anomalous features and improvements, but to re=EF=AC=82ect one set of desig= n ideas, than to have one that contains many good but independent and uncoordinated ideas." Right now we have the second, which is understandable, historically. Ok, onward. Re. Bill, I said cluster contains keyspaces/tables/databases, because multiple keyspaces can be defined within a cluster, as per the storage-conf.xml. That is all. I also mean it to refer to a physical collection of networked machines performing the same work. Re. Mark, I think collection is a mouthful too. Sets in math are not ordered, though, which makes me reluctant to support the use of the word "set". Re. Evans, it is true that Cassandra was influenced by Dynamo and BigTable. However, it is not merely a merge of those two. When I was getting started, everyone would say "Cassandra uses the BigTable" model, even though this was not actually the case. Super columns, local storage, and no column versioning are all significant and confusing diversions. Hypertable and Hbase cargo-cult^H^H^H^Hfollow that model strictly, so it makes more sense for them to keep the terminology. Database developers have read the BigTable and Dynamo papers. Database users have not. They will not, unless they are confused, and if they are confused, it will lead them further astray, because Cassandra's implementation has diverged. I disagree that the change would have a huge cost. A couple blog posts will be out of date. The Cassandra contributors (all 10 of them) will have to do a straightforward mental translation of terms for a few days before the new ones become comfortable. In my (statistically unsound) polls, the users, who don't even have a full grasp on the *current* terminology, will rejoice. BigTable's innovation was the data model, not the API. The source of our API problem is that in the BigTable paper, the API is directed towards a specific use case: a semi-column-oriented index store. However the data model itself is actually general, and that's what is interesting to our project. Things in the BigTable API that cause us significant problems: * String-concatenated colon API (we fixed this). * "Table", which prioritizes the column-oriented use, in direct opposition to the current use of the terminology (we fixed this, someplaces). * Being called a "column store", again prioritizing the specific use case, which is falsely analogous to relational column stores (this was never really enshrined in Cassandra). * Column "families", again prioritizing the specific use case (because it assumes that a document is spread across multiple families, and that a key, in isolation, refers to a globally unique document). Also a phrase used nowhere else in CS. * Having "columns" which are neither tabular columns, or attributes stored in column-major order, but attributes stored in (surprise!) row-major order. Maybe "attribute" is interchangeable with "column" in the relational world, but it's used in the (even more widely known) object-oriented world too, to mean exactly what we need it to mean. In regards to "column family", maybe "attribute family" would be a suitable compromise, and be familiar to BigTable people. It's also a grouping of keys, and a grouping of records, so I don't know why "column family" makes more sense than "key family" or "record family", except for historical reasons. If we went with "attribute family", then we would have Cluster, Database, Attribute family, Record, Attribute, and Attribute collection. What's the difference between "Attribute family" and "Attribute collection"? We'd have to revert to the meaningless "super" to avoid a conflict, and it breaks the downward hierarchy of terms. For the things which do not have official names, "row", "record", etc., I don't think saying "you can call it what you want" is workable. I run across this currently at my job, trying to talk about things to other people. We settled on "row" but feel weird about it, because you can never quite be sure if someone else means the same thing you do. So it always requires an explanation. In regards to the better examples, I did the best possible job I could at http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassa= ndra/ to give multiple clear examples. The post is very long, in a large part because the terminology is so foreign. Specifically for column and super column, I have to quite literally say "column: this is a tuple" and "super column: this is a named list". We should call them something that actually means that. More examples would help, but we need something else, because good examples are already there. This post got lots of readership, and is linked on the wiki, yet we still have confused users. I've been happily attacking other messed-up things in the Thrift API, with Jonathan's help. Those generate less controversy, so he's already committed tons of improvements. No more colons-concatenated API, no more modeling choices enshrined in the RPC names, abstracting and normalizing column references, etc. There's no reason this won't continue. Evan PS. It doesn't matter which version it goes into as long as it's before 1.0= . On Wed, Aug 12, 2009 at 7:27 AM, Eric Evans wrote: > On Tue, 2009-08-11 at 22:34 -0700, Arin Sarkissian wrote: >> But realistically how much of this confusion could be avoided with a >> legit example? Once you see a good example you start getting it. A lot >> of people have been pointed towards the ThriftIterface page on the >> wiki which clears up next to nothing: >> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like >> "edges", "base_attributes" etc. It's next door to nonsensical.. >> >> What if we had a real example that people could relate to... a model a >> blog or something along those lines & update the >> http://wiki.apache.org/cassandra/ThriftInterface page to show how each >> on the API methods would be used to accomplish basic tasks... ex: get >> all comments for a blog entry, list entires in time order, list >> entries tagged "bar", find all entries with "foo" in the body (kinda >> like the Facebook mail search example). > > Full ACK. > > Renaming everything carries a huge cost for (IMO) dubious benefit. > However, the cost-to-benefit ratio for better documentation and samples > seems excellent. > > -- > Eric Evans > eevans@rackspace.com > > --=20 Evan Weaver