Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of eweaver@gmail.com designates
 209.85.210.177 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=YjGNCer0S0DGfdnxu9miIKZhnIvlkFiOTiZPIWb59THPRcg6a+LVaCRSKS+30JnbZU
         wfvUyZOJEF3/8AdJkxnts+zQM9MCSRG6VLcHVsZY7dUvF1HNRnWqymF9k576WBcHs90R
         eHFPZUj3ZPBPYcWgsN7zE19hLNbnas9Wsie14=
MIME-Version: 1.0
In-Reply-To: <1250087250.23994.198.camel@achilles>
References: <b6f68fc60908111037ofdc0d6csa39543857e3583a2@mail.gmail.com>
	<1250051937.23994.179.camel@achilles>
 <8d9c091a0908112209s11703850td69973809de7cc74@mail.gmail.com>
	<991dde790908112234g5ecafaa5kec3f4387ff11d321@mail.gmail.com>
	<1250087250.23994.198.camel@achilles>
From: Evan Weaver <eweaver@gmail.com>
Date: Wed, 12 Aug 2009 10:58:57 -0700
Message-ID: <b6f68fc60908121058u197d7571ye77f0810b13bc4e9@mail.gmail.com>
Subject: Re: Fixing the data model names
To: cassandra-dev@incubator.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

It seems so far we have Eric strongly against, and a few others as
tentatively in favor, with caveats.

Before I address the points specifically, I'd like to refer you to
this API design manual from the QT team:
http://chaos.troll.no/~shausman/api-design/api-design.pdf.
Specifically, a quote: "It is better to have a system omit certain
anomalous features and improvements, but to re=EF=AC=82ect one set of desig=
n
ideas, than to have one that contains many good but independent and
uncoordinated ideas." Right now we have the second, which is
understandable, historically.

Ok, onward.

Re. Bill, I said cluster contains keyspaces/tables/databases, because
multiple keyspaces can be defined within a cluster, as per the
storage-conf.xml. That is all. I also mean it to refer to a physical
collection of networked machines performing the same work.

Re. Mark, I think collection is a mouthful too. Sets in math are not
ordered, though, which makes me reluctant to support the use of the
word "set".

Re. Evans, it is true that Cassandra was influenced by Dynamo and
BigTable. However, it is not merely a merge of those two. When I was
getting started, everyone would say "Cassandra uses the BigTable"
model, even though this was not actually the case. Super columns,
local storage, and no column versioning are all significant and
confusing diversions. Hypertable and Hbase cargo-cult^H^H^H^Hfollow
that model strictly, so it makes more sense for them to keep the
terminology.

Database developers have read the BigTable and Dynamo papers. Database
users have not. They will not, unless they are confused, and if they
are confused, it will lead them further astray, because Cassandra's
implementation has diverged.

I disagree that the change would have a huge cost. A couple blog posts
will be out of date. The Cassandra contributors (all 10 of them) will
have to do a straightforward mental translation of terms for a few
days before the new ones become comfortable. In my (statistically
unsound) polls, the users, who don't even have a full grasp on the
*current* terminology, will rejoice.

BigTable's innovation was the data model, not the API. The source of
our API problem is that in the BigTable paper, the API is directed
towards a specific use case: a semi-column-oriented index store.
However the data model itself is actually general, and that's what is
interesting to our project. Things in the BigTable API that cause us
significant problems:
  * String-concatenated colon API (we fixed this).
  * "Table", which prioritizes the column-oriented use, in direct
opposition to the current use of the terminology (we fixed this,
someplaces).
  * Being called a "column store", again prioritizing the specific use
case, which is falsely analogous to relational column stores (this was
never really enshrined in Cassandra).
  * Column "families", again prioritizing the specific use case
(because it assumes that a document is spread across multiple
families, and that a key, in isolation, refers to a globally unique
document). Also a phrase used nowhere else in CS.
  * Having "columns" which are neither tabular columns, or attributes
stored in column-major order, but attributes stored in (surprise!)
row-major order.

Maybe "attribute" is interchangeable with "column" in the relational
world, but it's used in the (even more widely known) object-oriented
world too, to mean exactly what we need it to mean. In regards to
"column family", maybe "attribute family" would be a suitable
compromise, and be familiar to BigTable people. It's also a grouping
of keys, and a grouping of records, so I don't know why "column
family" makes more sense than "key family" or "record family", except
for historical reasons. If we went with "attribute family", then we
would have Cluster, Database, Attribute family, Record, Attribute, and
Attribute collection. What's the difference between "Attribute family"
and "Attribute collection"? We'd have to revert to the meaningless
"super" to avoid a conflict, and it breaks the downward hierarchy of
terms.

For the things which do not have official names, "row", "record",
etc., I don't think saying "you can call it what you want" is
workable. I run across this currently at my job, trying to talk about
things to other people. We settled on "row" but feel weird about it,
because you can never quite be sure if someone else means the same
thing you do. So it always requires an explanation.

In regards to the better examples, I did the best possible job I could
at http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassa=
ndra/
to give multiple clear examples. The post is very long, in a large
part because the terminology is so foreign. Specifically for column
and super column, I have to quite literally say "column: this is a
tuple" and "super column: this is a named list". We should call them
something that actually means that. More examples would help, but we
need something else, because good examples are already there. This
post got lots of readership, and is linked on the wiki, yet we still
have confused users.

I've been happily attacking other messed-up things in the Thrift API,
with Jonathan's help. Those generate less controversy, so he's already
committed tons of improvements. No more colons-concatenated API, no
more modeling choices enshrined in the RPC names, abstracting and
normalizing column references, etc. There's no reason this won't
continue.

Evan

PS. It doesn't matter which version it goes into as long as it's before 1.0=
.

On Wed, Aug 12, 2009 at 7:27 AM, Eric Evans<eevans@rackspace.com> wrote:
> On Tue, 2009-08-11 at 22:34 -0700, Arin Sarkissian wrote:
>> But realistically how much of this confusion could be avoided with a
>> legit example? Once you see a good example you start getting it. A lot
>> of people have been pointed towards the ThriftIterface page on the
>> wiki which clears up next to nothing:
>> http://wiki.apache.org/cassandra/ThriftInterface . There's stuff like
>> "edges", "base_attributes" etc. It's next door to nonsensical..
>>
>> What if we had a real example that people could relate to... a model a
>> blog or something along those lines & update the
>> http://wiki.apache.org/cassandra/ThriftInterface page to show how each
>> on the API methods would be used to accomplish basic tasks... ex: get
>> all comments for a blog entry, list entires in time order, list
>> entries tagged "bar", find all entries with "foo" in the body (kinda
>> like the Facebook mail search example).
>
> Full ACK.
>
> Renaming everything carries a huge cost for (IMO) dubious benefit.
> However, the cost-to-benefit ratio for better documentation and samples
> seems excellent.
>
> --
> Eric Evans
> eevans@rackspace.com
>
>


--=20
Evan Weaver