incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Greene <michael.gre...@gmail.com>
Subject Re: Fixing the data model names
Date Thu, 13 Aug 2009 06:33:20 GMT
The internals of Thrift are not scary.  They're lexx/yacc so they're a
little opaque at first, but once you understand the model once it
applies to many parser generators.  Really, try hacking something
together for the Ruby generator if there's something you see missing.

With regards to recursive structures, unfortunately *that* would be
difficult in Thrift because of a decision made for the C++ library.
You could get parser support for it, and implement it for many of the
other language libraries, but it cannot be done with the current C++
library for reasons best found searching the Thrift mailing archives
or asking again from someone who knows more about it.

Avro is not nearly there.  Good work is being done on it, but only
C++, Java, and Python implementations have any reasonable progress,
and it is still being hashed out.  It could fit Cassandra well for
longer-held connections, once it's mature.

Michael

On Wed, Aug 12, 2009 at 11:39 PM, Evan Weaver<eweaver@gmail.com> wrote:
> PS. How's Avro these days? Or could we patch Thrift? Haven't looked at
> the internals but assume they're scary.
>
> On Thu, Aug 13, 2009 at 12:23 AM, Evan Weaver<eweaver@gmail.com> wrote:
>> Incidentally, is there any specific reason the collation has to be
>> pre-defined at the CF? What if any column could be an optional
>> supercolumn with a collation set at runtime? Then all CFs would be the
>> same.
>>
>> Evan
>>
>> On Wed, Aug 12, 2009 at 10:02 PM, Jonathan Ellis<jbellis@gmail.com> wrote:
>>> If thrift were sane it would look something like
>>>
>>> struct Column {
>>>  byte[] name,
>>>  optional list<Column> subcolumns,
>>>  optional int64 timestamp,
>>>  optional byte[] value
>>> }
>>>
>>> "you can either have the subcolumns, or the timestamp and value" seems
>>> reasonable to me.
>>>
>>> of course in the real world, thrift can't do recursive structures, so
>>> we'd have to go with Column/SubColumn like SuperColumn/Column today.
>>> So... maybe not really an improvement after all. :)
>>>
>>> (Why am I not surprised to find out that protocol buffers does support
>>> this?  Sigh.)
>>>
>>> On Wed, Aug 12, 2009 at 8:51 PM, Evan Weaver<eweaver@gmail.com> wrote:
>>>> Hmm, my Ruby client internally refers to columns and subcolumns,
>>>> rather than supercolumns and columns...mainly because the subcolumn
>>>> position is optional, but the column_or_supercolumn position is not.
>>>> So there is something we agree on.
>>>>
>>>> Do you think the lack of a timestamp in the supercolumn is confusing?
>>>> It's still not exactly a kind of column.
>>>>
>>>> Evan
>>>>
>>>> On Wed, Aug 12, 2009 at 9:47 PM, Jonathan Ellis<jbellis@gmail.com>
wrote:
>>>>> I agree with the proposition that the SuperColumn name is weak.
>>>>> (Although not, as I mentioned, Column or ColumnFamily.)  And I could
>>>>> go with schema over keyspace.
>>>>>
>>>>> One option to deal with SC would be to excise the term SC (and SCF
>>>>> from the config) and instead just have Columns, which may or may not
>>>>> have SubColumns.  You would define this as
>>>>>
>>>>> <ColumnFamily withSubColumns="true" .../>
>>>>>
>>>>> "Insert a subcolumn named A into the Column named B" fits pretty well
>>>>> with how I think of things working.  And now you just have Rows and
>>>>> Columns!  Just like a RDB! :P
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Wed, Aug 12, 2009 at 8:34 PM, Evan Weaver<eweaver@gmail.com>
wrote:
>>>>>> Points taken, and I agree, except in my experience the current names
>>>>>> are not Pretty Good but rather Pretty Weird; the primary issues being
>>>>>> column family and super column.
>>>>>>
>>>>>> If we go by the shorter-is-better principle, we might get:
>>>>>>
>>>>>> Cluster
>>>>>> Schema
>>>>>> Row set
>>>>>> Row w/key
>>>>>> Field set
>>>>>> Field
>>>>>>
>>>>>> "You take the user's key, and use that to insert into the Row Set
>>>>>> 'user_associations' at Field Set 'user_timeline,' a field named with
a
>>>>>> time-based UUID representing now, and with a value of the new tweet's
>>>>>> key."
>>>>>>
>>>>>> But let me study for a while and come up with a more researched proposal.
>>>>>>
>>>>>> Evan
>>>>>>
>>>>>> On Wed, Aug 12, 2009 at 9:21 PM, Jonathan Ellis<jbellis@gmail.com>
wrote:
>>>>>>> On Wed, Aug 12, 2009 at 7:52 PM, Michael Koziarski<michael@koziarski.com>
wrote:
>>>>>>>> However I think it's worth considering this from a strategic
>>>>>>>> perspective, looking at how we want the project do grow and
change,
>>>>>>>> rather than just as it is right now.  The key to successful
adoption
>>>>>>>> is having a successful elevator pitch,  you can start using
a database
>>>>>>>> without understanding relational-algebra because 'table'
and 'column'
>>>>>>>> are such simple ways to reason about the tool.  As it stands
>>>>>>>> cassandra's takes a whiteboard and 15 minutes, before people
get what
>>>>>>>> you're talking about.
>>>>>>>
>>>>>>> If you want to explain it as "sort of like a relational db" then
>>>>>>>
>>>>>>> table -> CF
>>>>>>> column -> column
>>>>>>> key -> key
>>>>>>> row -> row
>>>>>>>
>>>>>>> That's the simple case, then all you have is "supercolumns can
contain
>>>>>>> a list of simple columns."
>>>>>>>
>>>>>>> That really doesn't seem so hard to me.  I have explained this
to *managers*.
>>>>>>>
>>>>>>>> Assuming the project gets anything like the adoption it deserves,
the
>>>>>>>> users we have today will be a *tiny minority* of the users
we have in
>>>>>>>> the future.  So imposing costs on the current userbase which
will give
>>>>>>>> huge benefits to future users, should be something we're
willing to
>>>>>>>> do.  In fact it's something that has been done repeatedly
over the
>>>>>>>> last few weeks.
>>>>>>>
>>>>>>> I agree.  But as I said before I just don't see this as being
an improvement.
>>>>>>>
>>>>>>>> Given those changes went in without debate, I'm not sure
what the
>>>>>>>> reluctance is for making changes to the nomenclature for
the project.
>>>>>>>
>>>>>>> As above.
>>>>>>>
>>>>>>>> Speaking as someone who's only been doing this a month, the
naming is
>>>>>>>> *still* confusing, and when I talk with people who wonder
what
>>>>>>>> cassandra is all about I get blank looks when telling them
what things
>>>>>>>> are called.  If you step back and want to tell someone how
you'd
>>>>>>>> insert a tweet into someone's timeline using evan's weblog
post:
>>>>>>>>
>>>>>>>>  "You just take the user's key, and use that to insert into
the
>>>>>>>> SuperColumnFamily 'UserAssociations' at SubColumn 'user_timeline',
a
>>>>>>>> ColumnName of a time based uuid representing now, and a value
of the
>>>>>>>> new tweet's key"
>>>>>>>>
>>>>>>>> Column is in the name of 3 of the 5 concepts expressed, and
in each
>>>>>>>> cases it's different.
>>>>>>>
>>>>>>> When you're inserting something nested 3 levels deep a certain
amount
>>>>>>> of verbosity is unavoidable.  With Evan's nomenclature,
>>>>>>>
>>>>>>> "You take the user's record ID, and use that to insert into the
Record
>>>>>>> Collection 'user associations' at Attribute Collection
>>>>>>> 'user_timeline,' an Attribute named with a time based uuid
>>>>>>> representing now, and with a value of the new tweet's key."
>>>>>>>
>>>>>>> I think that is a negative improvement.  Yay, now we are talking
about
>>>>>>> Attribute Collections and Attributes instead of SuperColumns
and
>>>>>>> Columns.  The same objections ("one object's name contains the
>>>>>>> other's!) apply, plus the new one of sounding so generic that
it could
>>>>>>> apply to practically any system.
>>>>>>>
>>>>>>> -Jonathan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Evan Weaver
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>
>
>
> --
> Evan Weaver
>

Mime
View raw message