cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9708) Serialize ClusteringPrefixes in batches
Date Mon, 13 Jul 2015 09:50:06 GMT


Sylvain Lebresne commented on CASSANDRA-9708:

bq. if not slightly leaning in favour of retaining the lack of limit

Let's retain it here then, I have similar leanings. If we care about protecting users doing
the wrong thing, it's easy enough to add a warning at table creation time. And if it's a warning,
we can put it much lower than 32.

bq.  from a testing POV, we can test serialization in isolation with 33+, but kind of difficult
to do full extensive testing with that

I'd be totally fine with simple unit tests for this. We can do more extensive testing the
day we have nothing more useful to test, but something tells me that day won't come very soon.

> Serialize ClusteringPrefixes in batches
> ---------------------------------------
>                 Key: CASSANDRA-9708
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 3.0.0 rc1
> Typically we will have very few clustering prefixes to serialize, however in theory they
are not constrained (or are they, just to a very large number?). Currently we encode a fat
header for all values up front (two bits per value), however those bits will typically be
zero, and typically we will have only a handful (perhaps 1 or 2) of values.
> This patch modifies the encoding to batch the prefixes in groups of up to 32, along with
a header that is vint encoded. Typically this will result in a single byte per batch, but
will consume up to 9 bytes if some of the values have their flags set. If we have more than
32 columns, we just read another header. This means we incur no garbage, and compress the
data on disk in many cases where we have more than 4 clustering components.
> I do wonder if we shouldn't impose a limit on clustering columns, though: If you have
more than a handful merge performance is going to disintegrate. 32 is probably well in excess
of what we should be seeing in the wild anyway.

This message was sent by Atlassian JIRA

View raw message