cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-5664) Improve serialization in the native protocol
Date Wed, 21 Aug 2013 09:36:56 GMT


Sylvain Lebresne updated CASSANDRA-5664:

    Attachment: 0002-Avoid-copy-when-compressing-native-protocol-frames.txt

Thanks for the remarks. And while I don't disagree with those, it seems to me that they all
require a bit more investigation work to see whether they are worth the trouble, and while
we should certainly do this additional work at some point, I think the patches here do improve
the cleanness of the current serialization code enough that I'd rather leave those to a follow
up and commit this now (provided we agree that the patches are already an improvement on the
status quo). So attaching rebased patches but so far they are just rebased, nothing more.

But to address the observations more precisely:

bq. There's quite a bit of string encoding and object serialization going on in some of the
encodedSize() methods. This means that strings/objects will be encoded/serialized twice.

True. For strings, while I'm far from being an expert in unicode encodings and charsets, I'm
not sure how one could compute the size of the result of UTF8 encoding without doing the actual
encoding. Of course, the current CBUtil.sizeOfString does an array allocation which in theory
could be avoided, but I'm not aware of a method that return the size of an encoded string
without encoding it (and I'm certainly not writing one manually). So I'm open to suggestions,
but I doubt this is much of a bottleneck in practice (see below for more on that) so... As
for "object serialization" (that are not strings), I'm not totally sure which ones you are
referring to exactly but I don't see anything that looks like a great waste of effort.

bq. byte[] allocation and copying in encode() should be possible to avoid when serializing
strings by careful use of ChannelBuffer.toByteBuffer(), CharBuffer.wrap() and CharsetEncoder.encode().

You're probably right, but I believe that means assuming (correct me if I'm wrong) that the
result of ChannelBuffer.toByteBuffer() *will* share content with the ChannelBuffer it is called
on, which is not guaranteed by that API. And while it is true that as of this patch we only
call CBUtil.writeString() on ChannelBuffer for which this will be true, it bothers me to rely
on it as this could easily break in a very hard to notice way in the future. Furthermore,
I'm reasonably confident that this doesn't really matter in practice (performance wise) since
if you use prepared statement (which you should if you care about performance) and with the
optimization of the v2 protocol of not returning metadata in the result set for said prepared
statement, I don't think there's any string serialization going on in the hot path.

bq. It might be worth investigating if the code duplication in encode() and encodedSize()
can be eliminated

It does is worth the investigation. But we already do similar code duplication for other serialization
code (namely IVersionedSerializer stuffs).  So given the patch is already written and since,
as said above, I'm pretty convinced this does already is better than the status quo, I'd rather
left such refactoring to a latter ticket that could consider cleaning up all existing serialization
code (though on a personal level, that particular code duplication from IVersionedSerializer
has never bothered me much in practice so I don't see fixing it a priority).

> Improve serialization in the native protocol
> --------------------------------------------
>                 Key: CASSANDRA-5664
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 2.0
>         Attachments: 0001-Rewrite-encoding-methods.txt, 0002-Avoid-copy-when-compressing-native-protocol-frames.txt
> Message serialization in the native protocol currently make a Netty's ChannelBuffers.wrappedBuffer().
The rational was to avoid copying of the values bytes when such value are biggish. This has
a cost however, especially with lots of small values, and as suggested in CASSANDRA-5422,
this might well be a more common scenario for Cassandra, so let's consider directly serializing
in a newly allocated buffer.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message