cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Stevens <migh...@gmail.com>
Subject Re: Re: Dynamic Columns
Date Mon, 26 Jan 2015 14:37:52 GMT
> are you really recommending I throw 4 years of work out and completely
rewrite code that works and has been tested?

Our codebase was about 3 years old, and we finished migrating it to CQL not
that long ago.  It can definitely be frustrating to have to touch stable
code to modernize it.  Our design allowed us to focus on one or two tables
at a time.  We were able to do drop-in replacements for each DAO that
presented the same outward interface (though the DAOs each had to be
rewritten wholesale).

Critically, our business logic did not need to change at all to support the
new paradigm, so we could have great confidence that the change is
minimally disruptive.  Even our DAO unit tests were mostly only updated to
preload test data in a different format.

Something I didn't anticipate was that our changelog for the project
included twice as many deletes as adds - our overall code complexity went
down, both in number of lines of code, and also as measured in terms of
cyclomatic complexity.  Mean time to feature completion has been reduced as
well (which we can measure quite directly since for a little while we're
also maintaining parallel development of a legacy version that uses Thrift
exclusively).

> Could I fit a square peg into a round hole? Yes, but does that make any
sense?

I get your point, though I've always struggled with this particular idiom.
It was actually a design philosophy in fine woodworking in early America,
used as a way to have an especially strong joint that used no fasteners.
Known for being popular with the Shakers who had a reputation for producing
the highest quality of items.

I'd suggest fasteners come in many shapes and sizes and techniques.
Sometimes it's a peg, or a screw, or a rivet, sometimes it's dovetails, and
sometimes its fingers.  You're definitely right, Thrift and CQL are
dramatically different shapes.  I'm *certain* there are situations where
one or the other makes it easier to reason about or solve a particular
problem.  A different interface is not necessarily better or worse.

There are some pretty compelling ways CQL beats Thrift.  Reduced
application complexity (as I've observed in our case) is a pretty
compelling one.  But also new features not necessarily needing updates to
existing client libraries is also pretty awesome.  You can also take
advantage of a much more consistent application layer interaction across
languages and drivers, so you can more easily engage in multilanguage
projects without the context switching that comes from remembering nearly
as many nuances of *this* driver over *that* driver, and so on.  The native
protocol backing CQL also has a significant parallelism and performance
gain over Thrift's interface.


On Thu, Jan 22, 2015 at 6:36 AM, Peter Lin <woolfel@gmail.com> wrote:

> @jack thanks for taking time to respond. I agree I could totally redesign
> and rewrite it to fit in the newer CQL3 model, but are you really
> recommending I throw 4 years of work out and completely rewrite code that
> works and has been tested?
>
> Ignoring the practical aspects for now and exploring the topic a bit
> further. Since not everyone has spent 5+ years designing and building
> temporal databases, it's probably good to go over some fundamental theory
> at the risk of boring the hell out of everyone.
>
> 1. a temporal record ideally should have 1 unique key for the entire life.
> I've done other approaches in the past with composite keys in RDBMS and it
> sucks. Could it work? Yes, but I already know from first hand experience
> how much pain that causes when temporal records need to be moved around, or
> when you need to audit the data for law suits.
> 2. the life of a temporal record may have n versions and no two versions
> are guaranteed to be identical in structure and definitely not in content
> 3. the temporal metadata about each entity like version, previous version,
> branch, previous branch, create date, last transaction and version counter
> are required for each record. Those are the only required static columns
> and they are managed by the framework. User's aren't suppose to manually
> screw with those metadata columns, but obviously they could by going into
> cqlsh.
> 4. at any time, import and export of a single temporal record with all
> versions and metadata could occur, so optimal storage and design is
> critical. for example, if someone wants to copy or move 100,000 records
> from one cluster to another cluster and retain the history.
> 5. the user can query for all or any number of versions of a temporal
> record by version number or branch. For example, it's common to get version
> 10 and do a comparison against version 12. It's just like doing a diff in
> SVN, git, cvs, but for business data. Unlike text files, a diff on business
> records is a bit more complex, since it's an object graph.
> 6. versioning and branching of temporal records relies on business logic,
> which can be the stock algorithm or defined by the user
> 7. Saving and retrieving data has to be predictable and quick. This means
> ideally all versions are in the same row and on the same node. Pre CQL and
> composite keys, storing data in different rows meant it could be on
> different nodes. Thankfully with composite keys, Cassandra will use the
> first column as the partition key.
>
> In terms of adding dynamic_column_name as part of the composite key, that
> isn't ideal in my use case for several reasons.
>
> 1. a record might not have any dynamic columns at all. The user decides
> this. The only thing the framework requires is an unique key that doesn't
> collide. If the user chooses their own key instead of a UUID, the system
> checks for collision before saving a new record.
>
> 2. we use dynamic columns to provide projections, aka views of a temporal
> entity. This means we can extract fields nested deep in the graph and store
> it as a dynamic column to avoid reading the entire object. Unlike other
> kinds of use cases of dynamic column, the column name and value will vary.
> I know it's popular to use dynamic columns to store time series data like
> user click stream, but that has the same type.
>
> 3. we allow the user to index secondary columns, but on read we always use
> the value in the object. We also integrated solr to give us more advanced
> indexing features.
>
> 4. we provide an object API to make temporal queries easy. It's
> modeled/inspired by JPA/Hibernate. We could have invented another text
> query language or tried to use tsql2, but an object API feels more
> intuitive to me.
>
> Could I fit a square peg into a round hole? Yes, but does that make any
> sense? If I was building a whole new temporal database from scratch, I
> might do things different. I couldn't use CQL3 back in 2008/2009, so I
> couldn't have used it. Aside from all of that, an object API is more
> natural for temporal databases. The model really is an object graph and not
> separate database tables stitched together. Any change to any part of the
> record requires versioning it and handling it correctly. Having built
> temporal databases on RDBMS, using SQL meant building a general purpose
> object API to make things easier. This is due to the need to be database
> agnostic, so we couldn't use the object API that is available in some
> databases. Hopefully that helps provide context and details. I don't expect
> people to have a deep understanding of temporal database from my ramblings,
> given it took me over 8 years to learn all of this stuff.
>
>
> On Thu, Jan 22, 2015 at 12:51 AM, Jack Krupansky <jack.krupansky@gmail.com
> > wrote:
>
>> Peter,
>>
>> At least from your description, the proposed use of the clustering column
>> name seems at first blush to fully fit the bill. The point is not that the
>> resulting clustered primary key is used to reference an object, but that a
>> SELECT on the partition key references the entire object, which will be a
>> sequence of CQL3 rows in a partition, and then the clustering column key is
>> added when you wish to access that specific aspect of the object. What's
>> missing? Again, just store the partition key to reference the full object -
>> no pollution required!
>>
>> And please note that any number of clustering columns can be specified,
>> so more structured "dynamic columns" can be supported. For example, you
>> could have a timestamp as a separate clustering column to maintain temporal
>> state of the database. The partition key can also be structured from
>> multiple columns as a composite partition key as well.
>>
>> As far as all these static columns, consider them optional and merely an
>> optimization. If you wish to have a 100% opaque object model, you wouldn't
>> have any static columns and the only non-primary key column would be the
>> blob value field. Every object attribute would be specified using another
>> clustering column name and blob value. Presto, everything you need for a
>> pure, opaque, fully-generalized object management system - all with just
>> CQL3. Maybe we should include such an example in the doc and with the
>> project to more strongly emphasize this capability to fully model
>> arbitrarily complex object structures - including temporal structures.
>>
>> Anything else missing?
>>
>> As a general proposition, you can use the term "clustering column" in
>> CQL3 wherever you might have used "dynamic column" in Thrift. The point in
>> CQL3 is not to eliminate a useful feature, dynamic column, but to repackage
>> the feature to make a lot more sense for the vast majority of use cases.
>> Maybe there are some cases that doesn't exactly fit as well as desired, but
>> feel free to specifically identify such cases so that we can elaborate how
>> we think they are covered or at least covered well enough for most users.
>>
>>
>> -- Jack Krupansky
>>
>> On Wed, Jan 21, 2015 at 12:19 PM, Peter Lin <woolfel@gmail.com> wrote:
>>
>>>
>>> the example you provided does not work for for my use case.
>>>
>>>   CREATE TABLE t (
>>>     key blob,
>>>     static my_static_column_1 int,
>>>     static my_static_column_2 float,
>>>     static my_static_column_3 blob,
>>>     ....,
>>>     dynamic_column_name blob,
>>>     dynamic_column_value blob,
>>>     PRIMARY KEY (key, dynamic_column_name);
>>>   )
>>>
>>> the dynamic column can't be part of the primary key. The temporal entity
>>> key can be the default UUID or the user can choose the field in their
>>> object. Within our framework, we have concept of temporal links between one
>>> or more temporal entities. Poluting the primary key with the dynamic column
>>> wouldn't work.
>>>
>>> Please excuse the confusing RDB comparison. My point is that Cassandra's
>>> dynamic column feature is the "unique" feature that makes it better than
>>> traditional RDB or newSql like VoltDB for building temporal databases. With
>>> databases that require static schema + alter table for managing schema
>>> evolution, it makes it harder and results in down time.
>>>
>>> One of the challenges of data management over time is evolving the data
>>> model and making queries simple. If the record is 5 years old, it probably
>>> has a difference schema than a record inserted this week. With temporal
>>> databases, every update is an insert, so it's a little bit more complex
>>> than just "use a blob". There's a whole level of complication with temporal
>>> data and CQL3 custom types isn't clear to me. I've read the CQL3
>>> documentation on the custom types several times and it is rather poor. It
>>> gives me the impression there's still work needed to get custom types in
>>> good shape.
>>>
>>> With regard to examples others have told me, your advice is fair. A few
>>> minutes with google and some blogs should pop up. The reason I bring these
>>> things up isn't to put down CQL. It's because I care and want to help
>>> improve Cassandra by sharing my experience. I consistently recommend new
>>> users learn and understand both Thrift and CQL.
>>>
>>>
>>>
>>> On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne <sylvain@datastax.com
>>> > wrote:
>>>
>>>> On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin <woolfel@gmail.com> wrote:
>>>>
>>>>> I don't remember other people's examples in detail due to my shitty
>>>>> memory, so I'd rather not misquote.
>>>>>
>>>>
>>>> Fair enough, but maybe you shouldn't use "people's examples you don't
>>>> remenber" as argument then. Those examples might be wrong or outdated and
>>>> that kind of stuff creates confusion for everyone.
>>>>
>>>>
>>>>>
>>>>> In my case, I mix static and dynamic columns in a single column family
>>>>> with primitives and objects. The objects are temporal object graphs with
a
>>>>> known type. Doing this type of stuff is basically transparent for me,
since
>>>>> I'm using thrift and our data modeler generates helper classes. Our tooling
>>>>> seamlessly convert the bytes back to the target object. We have a few
>>>>> standard static columns related to temporal metadata. At any time, dynamic
>>>>> columns can be added and they can be primitives or objects.
>>>>>
>>>>
>>>> I don't see anything in that that cannot be done with CQL. You can mix
>>>> static and dynamic columns in CQL thanks to static columns. More precisely,
>>>> you can do what you're describing with a table looking a bit like this:
>>>>   CREATE TABLE t (
>>>>     key blob,
>>>>     static my_static_column_1 int,
>>>>     static my_static_column_2 float,
>>>>     static my_static_column_3 blob,
>>>>     ....,
>>>>     dynamic_column_name blob,
>>>>     dynamic_column_value blob,
>>>>     PRIMARY KEY (key, dynamic_column_name);
>>>>   )
>>>>
>>>> And your helper classes will serialize your objects as they probably do
>>>> today (if you use a custom comparator, you can do that too). And let it be
>>>> clear that I'm not pretending that doing it this way is tremendously
>>>> simpler than thrift. But I'm saying that 1) it's possible and 2) while it's
>>>> not meaningfully simpler than thriftMy , it's not really harder either (and
>>>> in fact, it's actually less verbose with CQL than with raw thrift).
>>>>
>>>>
>>>>>
>>>>> For the record, doing this kind of stuff in a relational database
>>>>> sucks horribly.
>>>>>
>>>>
>>>> I don't know what that has to do with CQL to be honest. If you're doing
>>>> relational with CQL you're doing it wrong. And please note that I'm not
>>>> saying CQL is the perfect API for modeling temporal data. But I don't get
>>>> how thrift, which is very crude API, is a much better API at that than CQL
>>>> (or, again, how it allows you to do things you can't with CQL).
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>
>>>
>>
>

Mime
View raw message