cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <ajc...@web2express.org>
Subject Re: Is SuperColumn necessary?
Date Tue, 11 May 2010 01:11:04 GMT
in your implementation, is the comment still sorted by TIME?  Will UTF8Type
sort <TimeUUID>:author by time?
thanks,
-aj

On Mon, May 10, 2010 at 5:02 PM, Mike Malone <mike@simplegeo.com> wrote:

> On Mon, May 10, 2010 at 4:31 PM, AJ Chen <ajchen@web2express.org> wrote:
>
>> supercolumn is good for modeling profile type of data. simple example is
>> blog:
>> blog { blog {author,  title, ...}
>>          comments   {time: commenter}  //sort by TimeUUID
>> }
>> when retrieving a blog, you get all the comments sorted by time already.
>> without supercolumn, you would need to concatenate multiple comment times
>> together as you suggested.
>>
>> requiring user to concatenating data fields together is not only an extra
>> burden on user but also a less clean design.  there will be cases where the
>> list property of a profile data is a long list (say a million items). in
>> such cases, user wants to be able to directly insert/delete an item in that
>> list because it's more efficient.  Retrieving the whole list, updating it,
>> concatenating again, and then putting it back to datastore is awkward and
>> less efficient.
>>
>
> There's nothing you said here that can't be implemented efficiently using
> columns. You can slice rows and get a subset of Columns. In fact, this
> example is particularly easy to implement. If you have a Blog with Entries
> and Comments you'd do:
>
>   <ColumnFamily Name="Blog" CompareWith="UTF8Type" />
>
>   Insert blog post:
>     batch_mutate(key=<blog post id>, [{name="~post:author",
> value=<author>}, {name="~post:title", value=<title>, ...))
>   Insert comment:
>     batch_mutate(key=<blog post id>, [{name=<TimeUUID> + ":author", ... }]
>
> Then you can get the Post only (slice for ["~", ""]), the comments only
> (slice for ["", "~"]), or the post _and_ comments (slice for ["", ""]).
> Inserting a comment does _not_ require a get/concatenate/insert.
>
> Yes, concatenating the names on the client side is hacky, clunky, and
> inconvenient. That's why we _should_ build an interface that doesn't require
> the client to concatenate names. But SuperColumns aren't the right way to do
> it. They add no value. They could be implemented in client libraries, for
> example, and nobody would know the difference.
>
> To really understand the problem with SuperColumns, though, you need to
> look at the Cassandra source. Removing SuperColumns would make the code-base
> much cleaner and tighter, and would probably reduce SLOC by 20%. I think a
> replacement that assumed nested Columns (or Entries, or Thingies) would be
> much cleaner. That's what Stu is working on.
>
> Mike
>
> On Mon, May 10, 2010 at 2:20 PM, Mike Malone <mike@simplegeo.com> wrote:
>>
>>> On Mon, May 10, 2010 at 1:38 PM, AJ Chen <ajchen@web2express.org> wrote:
>>>
>>>> Could someone confirm this discussion is not about abandoning
>>>> supercolumn family? I have found modeling data with supercolumn family is
>>>> actually an advantage of cassadra compared to relational database. Hope you
>>>> are going to drop this important concept.  How it's implemented internally
>>>> is a different matter.
>>>>
>>>
>>> SuperColumns are useful as a convenience mechanism. That's pretty much
>>> it. There's _nothing_ (as far as I can tell) that you can do with
>>> SuperColumns that you can't do by manually concatenating key names with a
>>> separator on the client side and implementing a custom comparator on the
>>> server (as ugly as that is).
>>>
>>> This discussion is about getting rid of SuperColumns and adding a more
>>> generic mechanism that will actually be useful and interesting and will
>>> continue to be convenient for the types of use cases for which people use
>>> SuperColumns.
>>>
>>> If there's a particular use case that you feel you can only implement
>>> with SuperColumns, please share! I honestly can't think of any.
>>>
>>> Mike
>>>
>>>
>>>> On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook <jshook@gmail.com>wrote:
>>>>
>>>>> Agreed
>>>>>
>>>>> On Mon, May 10, 2010 at 12:01 PM, Mike Malone <mike@simplegeo.com>
>>>>> wrote:
>>>>> > On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook <jshook@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I have to disagree about the naming of things. The name of something
>>>>> >> isn't just a literal identifier. It affects the way people think
>>>>> about
>>>>> >> it. For new users, the whole naming thing has been a persistent
>>>>> >> barrier.
>>>>> >
>>>>> > I'm saying we shouldn't be worried too much about coming up with
>>>>> names and
>>>>> > analogies until we've decided what it is we're naming.
>>>>> >
>>>>> >>
>>>>> >> As for your suggestions, I'm all for simplifying or generalizing
the
>>>>> >> "how it works" part down to a more generalized set of operations.
>>>>> I'm
>>>>> >> not sure it's a good idea to require users to think in terms
>>>>> building
>>>>> >> up a fluffy query structure just to thread it through a needle
of an
>>>>> >> API, even for the simplest of queries. At some point, the level
of
>>>>> >> generic boilerplate takes away from the semantic hand rails
that
>>>>> >> developers like. So I guess I'm suggesting that "how it works"
and
>>>>> >> "how we use it" are not always exactly the same. At least they
>>>>> should
>>>>> >> both hinge on a common conceptual model, which is where the
naming
>>>>> >> becomes an important anchoring point.
>>>>> >
>>>>> > If things are done properly, client libraries could expose simplified
>>>>> query
>>>>> > interfaces without much effort. Most ORMs these days work by building
>>>>> a
>>>>> > propositional directed acyclic graph that's serialized to SQL. This
>>>>> would
>>>>> > work the same way, but it wouldn't be converted into a 4GL.
>>>>> > Mike
>>>>> >
>>>>> >>
>>>>> >> Jonathan
>>>>> >>
>>>>> >> On Mon, May 10, 2010 at 11:37 AM, Mike Malone <mike@simplegeo.com>
>>>>> wrote:
>>>>> >> > Maybe... but honestly, it doesn't affect the architecture
or
>>>>> interface
>>>>> >> > at
>>>>> >> > all. I'm more interested in thinking about how the system
should
>>>>> work
>>>>> >> > than
>>>>> >> > what things are called. Naming things are important, but
that can
>>>>> happen
>>>>> >> > later.
>>>>> >> > Does anyone have any thoughts or comments on the architecture
I
>>>>> >> > suggested
>>>>> >> > earlier?
>>>>> >> >
>>>>> >> > Mike
>>>>> >> >
>>>>> >> > On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang <
>>>>> zsongbo@gmail.com>
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Yes, the "column" here is not appropriate.
>>>>> >> >> Maybe we need not to create new terms, in Google's
Bigtable, the
>>>>> term
>>>>> >> >> "qualifier" is a good one.
>>>>> >> >>
>>>>> >> >> On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn <
>>>>> david@lookin2.com>
>>>>> >> >> wrote:
>>>>> >> >>>
>>>>> >> >>> That would be a good time to get rid of the confusing
"column"
>>>>> term,
>>>>> >> >>> which incorrectly suggests a two-dimensional tabular
structure.
>>>>> >> >>>
>>>>> >> >>> Suggestions:
>>>>> >> >>>
>>>>> >> >>> 1. A hypercube (or hypocube, if only two dimensions):
replace
>>>>> "key"
>>>>> >> >>> and
>>>>> >> >>> "column" with "1st dimension", "2nd dimension",
etc.
>>>>> >> >>>
>>>>> >> >>> 2. A file system: replace "key" and "column" with
"directory"
>>>>> and
>>>>> >> >>> "subdirectory"
>>>>> >> >>>
>>>>> >> >>> 3. A tuple tree: "Column family" replaced by top-level
tuple,
>>>>> whose
>>>>> >> >>> value
>>>>> >> >>> is the set of keys, whose value is the set of supercolumns
of
>>>>> the key,
>>>>> >> >>> whose
>>>>> >> >>> value is the set of columns for the supercolumn,
etc.
>>>>> >> >>>
>>>>> >> >>> 4. Etc.
>>>>> >> >>>
>>>>> >> >>> On Thu, May 6, 2010 at 2:28 AM, Mike Malone <mike@simplegeo.com
>>>>> >
>>>>> >> >>> wrote:
>>>>> >> >>>>
>>>>> >> >>>> Nice, Ed, we're doing something very similar
but less generic.
>>>>> >> >>>> Now replace all of the various methods for
querying with a
>>>>> simple
>>>>> >> >>>> query
>>>>> >> >>>> interface that takes a Predicate, allow the
user to specify (in
>>>>> >> >>>> storage-conf) which levels of the nested Columns
should be
>>>>> indexed,
>>>>> >> >>>> and
>>>>> >> >>>> completely remove Comparators and have people
subclass Column /
>>>>> >> >>>> implement
>>>>> >> >>>> IColumn and we'd really be on to something
;).
>>>>> >> >>>> Mock storage-conf.xml:
>>>>> >> >>>>   <Column Name="ThingThatsNowKey" Indexed="True"
>>>>> >> >>>> ClusterPartitioned="True" Type="UTF8">
>>>>> >> >>>>     <Column Name="ThingThatsNowColumnFamily"
>>>>> DiskPartitioned="True"
>>>>> >> >>>> Type="UTF8">
>>>>> >> >>>>       <Column Name="ThingThatsNowSuperColumnName"
Type="Long">
>>>>> >> >>>>         <Column Name="ThingThatsNowColumnName"
Indexed="True"
>>>>> >> >>>> Type="ASCII">
>>>>> >> >>>>           <Column Name="ThingThatCantCurrentlyBeRepresented"/>
>>>>> >> >>>>         </Column>
>>>>> >> >>>>       </Column>
>>>>> >> >>>>     </Column>
>>>>> >> >>>>   </Column>
>>>>> >> >>>> Thrift:
>>>>> >> >>>>   struct NamePredicate {
>>>>> >> >>>>     1: required list<binary> column_names,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct SlicePredicate {
>>>>> >> >>>>     1: required binary start,
>>>>> >> >>>>     2: required binary end,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct CountPredicate {
>>>>> >> >>>>     1: required struct predicate,
>>>>> >> >>>>     2: required i32 count=100,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct AndPredicate {
>>>>> >> >>>>     1: required Predicate left,
>>>>> >> >>>>     2: required Predicate right,
>>>>> >> >>>>   }
>>>>> >> >>>>   struct SubColumnsPredicate {
>>>>> >> >>>>     1: required Predicate columns,
>>>>> >> >>>>     2: required Predicate subcolumns,
>>>>> >> >>>>   }
>>>>> >> >>>>   ... OrPredicate, OtherUsefulPredicates ...
>>>>> >> >>>>   query(predicate, count, consistency_level)
# Count here would
>>>>> be
>>>>> >> >>>> total
>>>>> >> >>>> count of leaf values returned, whereas CountPredicate
specifies
>>>>> a
>>>>> >> >>>> column
>>>>> >> >>>> count for a particular sub-slice.
>>>>> >> >>>> Not fully baked... but I think this could really
simplify stuff
>>>>> and
>>>>> >> >>>> make
>>>>> >> >>>> it more flexible. Downside is it may give people
enough rope to
>>>>> hang
>>>>> >> >>>> themselves, but at least the predicate stuff
is easily
>>>>> distributable.
>>>>> >> >>>> I'm thinking I'll play around with implementing
some of this
>>>>> stuff
>>>>> >> >>>> myself if I have any free time in the near
future.
>>>>> >> >>>> Mike
>>>>> >> >>>>
>>>>> >> >>>> On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis
<
>>>>> jbellis@gmail.com>
>>>>> >> >>>> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> Very interesting, thanks!
>>>>> >> >>>>>
>>>>> >> >>>>> On Wed, May 5, 2010 at 1:31 PM, Ed Anuff
<ed@anuff.com>
>>>>> wrote:
>>>>> >> >>>>> > Follow-up from last weeks discussion,
I've been playing
>>>>> around
>>>>> >> >>>>> > with a
>>>>> >> >>>>> > simple
>>>>> >> >>>>> > column comparator for composite column
names that I put up
>>>>> on
>>>>> >> >>>>> > github.  I'd
>>>>> >> >>>>> > be interested to hear what people
think of this approach.
>>>>> >> >>>>> >
>>>>> >> >>>>> > http://github.com/edanuff/CassandraCompositeType
>>>>> >> >>>>> >
>>>>> >> >>>>> > Ed
>>>>> >> >>>>> >
>>>>> >> >>>>> > On Wed, Apr 28, 2010 at 12:52 PM,
Ed Anuff <ed@anuff.com>
>>>>> wrote:
>>>>> >> >>>>> >>
>>>>> >> >>>>> >> It might make sense to create
a CompositeType subclass of
>>>>> >> >>>>> >> AbstractType for
>>>>> >> >>>>> >> the purpose of constructing and
comparing these types of
>>>>> >> >>>>> >> "composite"
>>>>> >> >>>>> >> column
>>>>> >> >>>>> >> names so that if you could more
easily do that sort of
>>>>> thing
>>>>> >> >>>>> >> rather
>>>>> >> >>>>> >> than
>>>>> >> >>>>> >> having to concatenate into one
big string.
>>>>> >> >>>>> >>
>>>>> >> >>>>> >> On Wed, Apr 28, 2010 at 10:25
AM, Mike Malone
>>>>> >> >>>>> >> <mike@simplegeo.com>
>>>>> >> >>>>> >> wrote:
>>>>> >> >>>>> >>>
>>>>> >> >>>>> >>> The only thing SuperColumns
appear to buy you (as someone
>>>>> >> >>>>> >>> pointed
>>>>> >> >>>>> >>> out to
>>>>> >> >>>>> >>> me at the Cassandra meetup
- I think it was Eric
>>>>> Florenzano) is
>>>>> >> >>>>> >>> that you can
>>>>> >> >>>>> >>> use different comparator types
for the Super/SubColumns, I
>>>>> >> >>>>> >>> guess..?
>>>>> >> >>>>> >>> But you
>>>>> >> >>>>> >>> should be able to do the same
thing by creating your own
>>>>> Column
>>>>> >> >>>>> >>> comparator.
>>>>> >> >>>>> >>> I guess my point is that SuperColumns
are mostly a
>>>>> convenience
>>>>> >> >>>>> >>> mechanism, as
>>>>> >> >>>>> >>> far as I can tell.
>>>>> >> >>>>> >>> Mike
>>>>> >> >>>>> >
>>>>> >> >>>>> >
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>>>> --
>>>>> >> >>>>> Jonathan Ellis
>>>>> >> >>>>> Project Chair, Apache Cassandra
>>>>> >> >>>>> co-founder of Riptano, the source for professional
Cassandra
>>>>> support
>>>>> >> >>>>> http://riptano.com
>>>>> >> >>>>
>>>>> >> >>>
>>>>> >> >>
>>>>> >> >
>>>>> >> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> AJ Chen, PhD
>>>> Chair, Semantic Web SIG, sdforum.org
>>>> http://web2express.org
>>>> twitter @web2express
>>>> Palo Alto, CA, USA
>>>>
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Mime
View raw message