cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Malone <m...@simplegeo.com>
Subject Re: Is SuperColumn necessary?
Date Tue, 11 May 2010 00:02:37 GMT
On Mon, May 10, 2010 at 4:31 PM, AJ Chen <ajchen@web2express.org> wrote:

> supercolumn is good for modeling profile type of data. simple example is
> blog:
> blog { blog {author,  title, ...}
>          comments   {time: commenter}  //sort by TimeUUID
> }
> when retrieving a blog, you get all the comments sorted by time already.
> without supercolumn, you would need to concatenate multiple comment times
> together as you suggested.
>
> requiring user to concatenating data fields together is not only an extra
> burden on user but also a less clean design.  there will be cases where the
> list property of a profile data is a long list (say a million items). in
> such cases, user wants to be able to directly insert/delete an item in that
> list because it's more efficient.  Retrieving the whole list, updating it,
> concatenating again, and then putting it back to datastore is awkward and
> less efficient.
>

There's nothing you said here that can't be implemented efficiently using
columns. You can slice rows and get a subset of Columns. In fact, this
example is particularly easy to implement. If you have a Blog with Entries
and Comments you'd do:

  <ColumnFamily Name="Blog" CompareWith="UTF8Type" />

  Insert blog post:
    batch_mutate(key=<blog post id>, [{name="~post:author", value=<author>},
{name="~post:title", value=<title>, ...))
  Insert comment:
    batch_mutate(key=<blog post id>, [{name=<TimeUUID> + ":author", ... }]

Then you can get the Post only (slice for ["~", ""]), the comments only
(slice for ["", "~"]), or the post _and_ comments (slice for ["", ""]).
Inserting a comment does _not_ require a get/concatenate/insert.

Yes, concatenating the names on the client side is hacky, clunky, and
inconvenient. That's why we _should_ build an interface that doesn't require
the client to concatenate names. But SuperColumns aren't the right way to do
it. They add no value. They could be implemented in client libraries, for
example, and nobody would know the difference.

To really understand the problem with SuperColumns, though, you need to look
at the Cassandra source. Removing SuperColumns would make the code-base much
cleaner and tighter, and would probably reduce SLOC by 20%. I think a
replacement that assumed nested Columns (or Entries, or Thingies) would be
much cleaner. That's what Stu is working on.

Mike

On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mike@simplegeo.com> wrote:
>
>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <ajchen@web2express.org> wrote:
>>
>>> Could someone confirm this discussion is not about abandoning supercolumn
>>> family? I have found modeling data with supercolumn family is actually an
>>> advantage of cassadra compared to relational database. Hope you are going to
>>> drop this important concept.  How it's implemented internally is a different
>>> matter.
>>>
>>
>> SuperColumns are useful as a convenience mechanism. That's pretty much it.
>> There's _nothing_ (as far as I can tell) that you can do with SuperColumns
>> that you can't do by manually concatenating key names with a separator on
>> the client side and implementing a custom comparator on the server (as ugly
>> as that is).
>>
>> This discussion is about getting rid of SuperColumns and adding a more
>> generic mechanism that will actually be useful and interesting and will
>> continue to be convenient for the types of use cases for which people use
>> SuperColumns.
>>
>> If there's a particular use case that you feel you can only implement with
>> SuperColumns, please share! I honestly can't think of any.
>>
>> Mike
>>
>>
>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <jshook@gmail.com>wrote:
>>>
>>>> Agreed
>>>>
>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mike@simplegeo.com>
>>>> wrote:
>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <jshook@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I have to disagree about the naming of things. The name of something
>>>> >> isn't just a literal identifier. It affects the way people think
>>>> about
>>>> >> it. For new users, the whole naming thing has been a persistent
>>>> >> barrier.
>>>> >
>>>> > I'm saying we shouldn't be worried too much about coming up with names
>>>> and
>>>> > analogies until we've decided what it is we're naming.
>>>> >
>>>> >>
>>>> >> As for your suggestions, I'm all for simplifying or generalizing
the
>>>> >> "how it works" part down to a more generalized set of operations.
I'm
>>>> >> not sure it's a good idea to require users to think in terms building
>>>> >> up a fluffy query structure just to thread it through a needle of
an
>>>> >> API, even for the simplest of queries. At some point, the level
of
>>>> >> generic boilerplate takes away from the semantic hand rails that
>>>> >> developers like. So I guess I'm suggesting that "how it works" and
>>>> >> "how we use it" are not always exactly the same. At least they should
>>>> >> both hinge on a common conceptual model, which is where the naming
>>>> >> becomes an important anchoring point.
>>>> >
>>>> > If things are done properly, client libraries could expose simplified
>>>> query
>>>> > interfaces without much effort. Most ORMs these days work by building
>>>> a
>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>> would
>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>> > Mike
>>>> >
>>>> >>
>>>> >> Jonathan
>>>> >>
>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mike@simplegeo.com>
>>>> wrote:
>>>> >> > Maybe... but honestly, it doesn't affect the architecture or
>>>> interface
>>>> >> > at
>>>> >> > all. I'm more interested in thinking about how the system should
>>>> work
>>>> >> > than
>>>> >> > what things are called. Naming things are important, but that
can
>>>> happen
>>>> >> > later.
>>>> >> > Does anyone have any thoughts or comments on the architecture
I
>>>> >> > suggested
>>>> >> > earlier?
>>>> >> >
>>>> >> > Mike
>>>> >> >
>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <zsongbo@gmail.com
>>>> >
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Yes, the "column" here is not appropriate.
>>>> >> >> Maybe we need not to create new terms, in Google's Bigtable,
the
>>>> term
>>>> >> >> "qualifier" is a good one.
>>>> >> >>
>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>> david@lookin2.com>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> That would be a good time to get rid of the confusing
"column"
>>>> term,
>>>> >> >>> which incorrectly suggests a two-dimensional tabular
structure.
>>>> >> >>>
>>>> >> >>> Suggestions:
>>>> >> >>>
>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions):
replace
>>>> "key"
>>>> >> >>> and
>>>> >> >>> "column" with "1st dimension", "2nd dimension", etc.
>>>> >> >>>
>>>> >> >>> 2. A file system: replace "key" and "column" with "directory"
and
>>>> >> >>> "subdirectory"
>>>> >> >>>
>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level
tuple,
>>>> whose
>>>> >> >>> value
>>>> >> >>> is the set of keys, whose value is the set of supercolumns
of the
>>>> key,
>>>> >> >>> whose
>>>> >> >>> value is the set of columns for the supercolumn, etc.
>>>> >> >>>
>>>> >> >>> 4. Etc.
>>>> >> >>>
>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mike@simplegeo.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Nice, Ed, we're doing something very similar but
less generic.
>>>> >> >>>> Now replace all of the various methods for querying
with a
>>>> simple
>>>> >> >>>> query
>>>> >> >>>> interface that takes a Predicate, allow the user
to specify (in
>>>> >> >>>> storage-conf) which levels of the nested Columns
should be
>>>> indexed,
>>>> >> >>>> and
>>>> >> >>>> completely remove Comparators and have people subclass
Column /
>>>> >> >>>> implement
>>>> >> >>>> IColumn and we'd really be on to something ;).
>>>> >> >>>> Mock storage-conf.xml:
>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>> DiskPartitioned="True"
>>>> >> >>>> Type="UTF8">
>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName"
Type="Long">
>>>> >> >>>>         <Column Name="ThingThatsNowColumnName"
Indexed="True"
>>>> >> >>>> Type="ASCII">
>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>> >> >>>>         </Column>
>>>> >> >>>>       </Column>
>>>> >> >>>>     </Column>
>>>> >> >>>>   </Column>
>>>> >> >>>> Thrift:
>>>> >> >>>>   struct NamePredicate {
>>>> >> >>>>     1: required list<binary> column_names,
>>>> >> >>>>   }
>>>> >> >>>>   struct SlicePredicate {
>>>> >> >>>>     1: required binary start,
>>>> >> >>>>     2: required binary end,
>>>> >> >>>>   }
>>>> >> >>>>   struct CountPredicate {
>>>> >> >>>>     1: required struct predicate,
>>>> >> >>>>     2: required i32 count=100,
>>>> >> >>>>   }
>>>> >> >>>>   struct AndPredicate {
>>>> >> >>>>     1: required Predicate left,
>>>> >> >>>>     2: required Predicate right,
>>>> >> >>>>   }
>>>> >> >>>>   struct SubColumnsPredicate {
>>>> >> >>>>     1: required Predicate columns,
>>>> >> >>>>     2: required Predicate subcolumns,
>>>> >> >>>>   }
>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>> >> >>>>   query(predicate, count, consistency_level) #
Count here would
>>>> be
>>>> >> >>>> total
>>>> >> >>>> count of leaf values returned, whereas CountPredicate
specifies
>>>> a
>>>> >> >>>> column
>>>> >> >>>> count for a particular sub-slice.
>>>> >> >>>> Not fully baked... but I think this could really
simplify stuff
>>>> and
>>>> >> >>>> make
>>>> >> >>>> it more flexible. Downside is it may give people
enough rope to
>>>> hang
>>>> >> >>>> themselves, but at least the predicate stuff is
easily
>>>> distributable.
>>>> >> >>>> I'm thinking I'll play around with implementing
some of this
>>>> stuff
>>>> >> >>>> myself if I have any free time in the near future.
>>>> >> >>>> Mike
>>>> >> >>>>
>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis
<
>>>> jbellis@gmail.com>
>>>> >> >>>> wrote:
>>>> >> >>>>>
>>>> >> >>>>> Very interesting, thanks!
>>>> >> >>>>>
>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff <ed@anuff.com>
wrote:
>>>> >> >>>>> > Follow-up from last weeks discussion,
I've been playing
>>>> around
>>>> >> >>>>> > with a
>>>> >> >>>>> > simple
>>>> >> >>>>> > column comparator for composite column
names that I put up on
>>>> >> >>>>> > github.  I'd
>>>> >> >>>>> > be interested to hear what people think
of this approach.
>>>> >> >>>>> >
>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>> >> >>>>> >
>>>> >> >>>>> > Ed
>>>> >> >>>>> >
>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff
<ed@anuff.com>
>>>> wrote:
>>>> >> >>>>> >>
>>>> >> >>>>> >> It might make sense to create a CompositeType
subclass of
>>>> >> >>>>> >> AbstractType for
>>>> >> >>>>> >> the purpose of constructing and comparing
these types of
>>>> >> >>>>> >> "composite"
>>>> >> >>>>> >> column
>>>> >> >>>>> >> names so that if you could more easily
do that sort of thing
>>>> >> >>>>> >> rather
>>>> >> >>>>> >> than
>>>> >> >>>>> >> having to concatenate into one big
string.
>>>> >> >>>>> >>
>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25 AM,
Mike Malone
>>>> >> >>>>> >> <mike@simplegeo.com>
>>>> >> >>>>> >> wrote:
>>>> >> >>>>> >>>
>>>> >> >>>>> >>> The only thing SuperColumns appear
to buy you (as someone
>>>> >> >>>>> >>> pointed
>>>> >> >>>>> >>> out to
>>>> >> >>>>> >>> me at the Cassandra meetup - I
think it was Eric
>>>> Florenzano) is
>>>> >> >>>>> >>> that you can
>>>> >> >>>>> >>> use different comparator types
for the Super/SubColumns, I
>>>> >> >>>>> >>> guess..?
>>>> >> >>>>> >>> But you
>>>> >> >>>>> >>> should be able to do the same
thing by creating your own
>>>> Column
>>>> >> >>>>> >>> comparator.
>>>> >> >>>>> >>> I guess my point is that SuperColumns
are mostly a
>>>> convenience
>>>> >> >>>>> >>> mechanism, as
>>>> >> >>>>> >>> far as I can tell.
>>>> >> >>>>> >>> Mike
>>>> >> >>>>> >
>>>> >> >>>>> >
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>> --
>>>> >> >>>>> Jonathan Ellis
>>>> >> >>>>> Project Chair, Apache Cassandra
>>>> >> >>>>> co-founder of Riptano, the source for professional
Cassandra
>>>> support
>>>> >> >>>>> http://riptano.com
>>>> >> >>>>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>

Mime
View raw message