cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
Date Fri, 18 Jul 2014 16:42:07 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066498#comment-14066498
] 

Robert Stupp commented on CASSANDRA-4175:
-----------------------------------------

My five cent ;) Sorry, if I repeat some things, didn't read everything...

Using such a enum/map of _column-id_ to _column-name_ should also include UDT field names

The id generator for the _column-id_ could be per-keyspace (maybe something like a _next-column-id_
field per keyspace)

I guess a typical column name is 10-15 chars long.
So the savings on heap and off-heap are worth implementing that enum/map - such a typical
column name {{String}} occupies about 60 bytes on heap - an {{int}} just 4. And it removes
pressure from GC.

Savings could also occur on the wire (between nodes), in the commit log and in data files.
If the _column-id_ is globlally unique per KS, sstable files remain to be portable between
nodes (are they portable?).

It might also save bandwidth when serializing result sets back to the client (if all clients
shall have to know about that id-name mapping).

> Reduce memory, disk space, and cpu usage with a column name/id map
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-4175
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Jason Brown
>              Labels: performance
>             Fix For: 3.0
>
>
> We spend a lot of memory on column names, both transiently (during reads) and more permanently
(in the row cache).  Compression mitigates this on disk but not on the heap.
> The overhead is significant for typical small column values, e.g., ints.
> Even though we intern once we get to the memtable, this affects writes too via very high
allocation rates in the young generation, hence more GC activity.
> Now that CQL3 provides us some guarantees that column names must be defined before they
are inserted, we could create a map of (say) 32-bit int column id, to names, and use that
internally right up until we return a resultset to the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message