cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terje Marthinussen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
Date Wed, 03 Jul 2013 07:34:21 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698710#comment-13698710
] 

Terje Marthinussen commented on CASSANDRA-4175:
-----------------------------------------------

I should maybe add, 1 and 2 above does not exclude but rather complement each other.

#1 is a manual map and could allow things like a prefix map such as '$201212' which will map
all such prefixes to an id

#2 is a auto map. It may require 1 if we want to consider to allow user to give "hints" to
substring maps such as '$(201\d\d\d)' to map all year+month like string starting on 201 to
a mapping entry. This will just be a hint. The sampling of number of entries should decide
what gets mapped to avoid running out of memory.

I am a bit unsure if these advanced features like substrings would never be used and should
maybe only be  implemented as some sort of substring detection separately. 

As this can be a bit processing intensive, substring statistics (top substrings) could be
detected and maintained node wide in compaction and given as hints to the serializer later.

                
> Reduce memory, disk space, and cpu usage with a column name/id map
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-4175
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>             Fix For: 2.1
>
>
> We spend a lot of memory on column names, both transiently (during reads) and more permanently
(in the row cache).  Compression mitigates this on disk but not on the heap.
> The overhead is significant for typical small column values, e.g., ints.
> Even though we intern once we get to the memtable, this affects writes too via very high
allocation rates in the young generation, hence more GC activity.
> Now that CQL3 provides us some guarantees that column names must be defined before they
are inserted, we could create a map of (say) 32-bit int column id, to names, and use that
internally right up until we return a resultset to the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message