cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: Wide rows in CQL 3
Date Wed, 09 Jan 2013 22:14:25 GMT
> There is no "upgrade path".

I don't think that's true. The goal of the blog post you've linked is to
discuss that upgrade path (and in particular show that for the most part,
you
can access your thrift data from CQL3 without any modification whatsoever).

> You adopt CQL3's sparse tables as soon as you start creating column
families
> from CQL.

That's not true, you can create non sparse from CQL3 (using COMPACT STORAGE)
and so you can work with both CQL3 and thrift alongside the time it takes
you
to upgrade from thrift to CQL3. Then, for things that you know you will only
access to CQL3 (i.e. when the "upgrade is complete"), you can start using
non
compact tables and enjoy their convenience (like collections for instance).

> There is not much backwards compatibility. CQL3 can query compact tables,
but
> you may have to remove the metadata from them so they can be transposed.

I think "not much backwards compatibility" is a tad unfair. The only case
where
you "may have to remove the metadata" is if you are using a CF in both a
static
and dynamic way. Now I can't pretend knowing what every user is doing, but
from
my experience and what I've seen, this is not such a common thing and CF are
either static or dynamic in nature, not both.

I do think that for most user upgrading from thrift to CQL3 won't require
any
data migration or messing with metadata. But more importantly, things are
not
completely closed. If you have *concrete* difficulties moving from thrift to
CQL3, please do share them on this mailing list and we'll try to help you
out.

> Thrift can not write into CQL tables easily, because of how the primary
keys
> and column names are encoded into the key column and compact metadata is
not
> equal to cql3's metadata.

I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one,
you
can stick to thrift if you don't think CQL3 is better. But if you do decide
to
upgrade, you should see CQL3 non compact tables as the new stuff, the thing
that you use post-upgrade. While you upgrade, stick to compact tables. Once
you've upgraded, then you can start using the new stuff and accessing the
new
stuff the old way doesn't matter.

> My biggest beefs are:
> 1) column names are UTF8 (seems wasteful in most cases)

That's largely not true, the "wasteful in most cases" part at least. A
column
name in CQL3 does not always translate to a internal column name. You can
still
do your time series where the internal column name is an int and you don't
waste space.

As for the static cases, yes, CQL3 forces UTF8, I'm pretty certain that
people
overwhelmingly use UTF8 or ascii in those cases. And because CQL3 forces
you to
declare your column names in those static cases, we may actually be able to
optimize the size used internally for those in the future, which is harder
with
thrift, so I think we actually have the potential to make is less wasteful
in
most cases.

> 2) sparse empty row to ghost (seems like tiny rows with one column have
much
> overhead now)

It is true that for non compact CQL3 we've focused on flexibility and on
making
the behavior predictable, which does adds some slight space overhead.
However:
- that's why compact storage is here. There is zero overhead over thrift if
  you use compact storage. That's even why we named it like that, it's
compact.
- we know that most the overhead of non compact tables can be win back by
  optimization of the storage engine. That's an advantage of having an API
  that is not too ties to the underlying storage: it gives room for
  optimizations.

> 3) using composites (with (compound primary keys) in some table designs)
is
> wasteful. Composite adds two unsigned bytes for size and one unsigned
byte as
> 0 per part.

See above.

> 4) many lines of code between user/request and actual disk. (tracing a CQL
> select VS a slice, young gen, etc)

If you are saying the implementation of CQL3 is more lines of code than the
thrift part, then you're probably right, but given how much convenient CQL3
is
compared to thrift, I happily take that criticism.

But in term of overhead, provided you use prepared statement (which you
should
if you care about performance), then it remains to be proven that CQL3 has
more
overhead than thrift. In particular in terms of garbage (since you're citing
young gen), while I haven't tested it, I'd be *really* surprised if thrift
is
generating less garbage than CQL3. And in term of the query tracing there is
almost no difference whatsoever between the two.

> 5) not sure if "collections" can be used in REALLY wide row scenarios. aka
> 1,000,000 entry set?

Lists have their downsides (listed in the documentation) but for sets and
maps,
they have no more limitation than wide rows have in theory. They do have the
limitation with the currently the API don't allow to fetch parts of a
collection. But that will change.

That being said and possibly more importantly, collections are *not* meant
to
be very wide. They are *not* meant for wide row scenarios. CQL3 has wide
rows
support (in the sense of thrift) *without* collections and for true wide row
scenarios you want to dedicate it a CF, because that is the right thing to
do.

--
Sylvain

Mime
View raw message