From "Terje Marthinussen (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
Date Wed, 03 Jul 2013 07:08:22 GMT


Terje Marthinussen commented on CASSANDRA-4175:


Sorry for the late update.

Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 billion by
now) which implements a column name map and has been in production for about 2 years.

I was actually looking at committing this 2 years ago together with fairly large number of
other changes which was implemented in the column/supercolumn serializer code but I never
got  around to implement a good way to push the sstable version numbers into the serializer
to make things backwards compatible before focus moved resources elsewhere.

As mentioned above by others, while not benchmarked and proven, I had a very good feeling
the total change helped quite a bit on GC issues, memtables and a bit on performance in general,
but in terms of disk space, the benefit was somewhat limited after sstable compression was
implemented as the repeating column names are compressed pretty well.

This is already 2 years ago (the cluster still runs by the way), but if memory serves me right:
30-40% reduction in disk space without compression
10% reduction on top of compression (I did a test after it was implemented).

In my case, the implementation is actually hardcoded due to time constraints. A static map
which is global for the entire cassandra installation.

If committing this into cassandra, I believe my plan was split in 3.
Possible as 3 different implementation stages:

1. A simple config option (as a config file or as a columnfamily) where users themselves can
assign repeating column names. Sure, it is not as fancy as many other options, but maybe we
could open up to cover some strange corner case usages here with things like substrings as

Think options to cover complex versions of patterns like date/times such as 20130701202020
where a large chunk of the column name repeats, but not all of it.

In the current implementation, if there is a mapping entry, it converts the string to a variable
length integer which becomes the new column name. If there is no mapping entry, it stores
the raw data.

In our case, we have <40 repeating column names so I never need more than 1 byte, but the
implementation would handle more if I had.

I modified the sstable to add a bitmap at the start of each column to be able to turn on/off
mapping entries, timestamps not used, TTL's and other things. There is a bunch of 64 bit numbers
in the column format which only have default value in 99.999% of all cases and very often
your column value is just an 8 byte int, a boolean or a short text entry. 

I think in 99% of the columns in this cassandra store, the column timestamp takes up more
space than the column value.

This would have been my first implementation. Mostly because I have a working implementation
of it already and the mapping table would be very easy to move to a config file read at start
of a column family similar to what we have for CF config but also here, it is a bit work to
push such config data down to the serializer as the code was organized 2 years ago.

Notice again, you do not need atomic handling of the updates to the map in any way in this
implementation. You can add map entries at any time. The result after deserializing is always
the same as column names can have a mix of raw and map id values thanks to the "column feature
bitmap" that was introduced.

2. Auto learning feature with mapping table per sstable. 
This would be stage 2 of the implementation.

When starting to create a new SSTable, build a sampling of the most frequently occuring column
names and gradually start mapping them to ID's.

Add the mapping table to the end of the SSTable or in a separate .map file (similar to index
files) at the completion of sstable generation.

The initial id mapping could be further improved by maintaining a global map of column names.
This "global map" would not be used for serialization/deserialization. It would be used to
pre-populate the value for a sstable and would only be statistics to optimize things further
by reducing the number of mapping variances between sstables and reducing the number of raw
values getting stored a bit more.

The id map would still be local to each sstable in terms of storage, but having such statistics
would allow you to dramatically reduce the size of a potentially shared id cache across sstables
where a lot of mapping entries would be identical.

Some may feel that we would run out of memory quickly or use a lot of extra disk with maps
per sstable, but I guess that we only really need to deal with the top few thousand entries
in each sstable and this would not be a problem to keep in a idmap cache in terms of size.

This is really just the top X re-occuring column names or column name sub pattern

If you have more unique column entries that this in a sstable, this will probably not be the
feature that will save the day anyway as the benefit per column entry will be quite small
vs. the overhead and the entire feature should potentially disable itself automagically if
there is no frequently repeating patterns.

3. I had some ideas for moving the mapping up from the serializer to allow things like streaming
entries including id maps between nodes, but things do indeed quickly get ugly and I do not
remember clearly how I had planned to do this.

The reason I isolated the mapping function to the serializer is that it looked incredibly
messy to move this further "up" in the stack. Column sorts, range scans, lookukups... 

Not fun at all and if the memtable is serialized anyway the memory consumption there and in
disk cache is dramatically reduced.

Also... with a global static map here at startup time, I actually share the mapped strings
across most columns in memory anyway as I believe they all become pointers to my static complied
in map (again, this gets a lot more trivial to make work very well if this is a startup config,
but yes a bit less user unfriendly)

I haven't looked at the cassandra code for way to long now.

Has it become easier to get to know sstable version numbers in the serializer class now?

I could maybe check if someone in the team here would like to take a stab at moving this to
latest cassandra and commit it if the above implementation seems interesting. 

Part of it should be really easy to port as long as we can get a bit more info into the serializer/deserializer.

> Reduce memory, disk space, and cpu usage with a column name/id map
> ------------------------------------------------------------------
>                 Key: CASSANDRA-4175
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>             Fix For: 2.1
> We spend a lot of memory on column names, both transiently (during reads) and more permanently
(in the row cache).  Compression mitigates this on disk but not on the heap.
> The overhead is significant for typical small column values, e.g., ints.
> Even though we intern once we get to the memtable, this affects writes too via very high
allocation rates in the young generation, hence more GC activity.
> Now that CQL3 provides us some guarantees that column names must be defined before they
are inserted, we could create a map of (say) 32-bit int column id, to names, and use that
internally right up until we return a resultset to the client.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

