cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Patel (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate
Date Wed, 24 Sep 2014 10:10:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146132#comment-14146132
] 

Jay Patel edited comment on CASSANDRA-7882 at 9/24/14 10:09 AM:
----------------------------------------------------------------

Hey Benedict, 
I've attached the first cut. Pls. help review. 

Below are some code changes and design choices/trade-offs. 

* Wait-free region scaling and allocations:

** Instead of one global queue of 1 MB race allocated regions, there’re are now set of global
queues, one for each region size (8K, 16K, ..1MB). All queues are global (not per memtable)
so memtables across all the tables can reuse the race allotted regions. Race allocated regions
will never be cleaned during memtable flushes. 

** Thread who wins in the race of setting new region as the current region, also scales the
region size (if it’s not already at the max). This avoids need for extra synchronization
for scaling region size atomically.

* Region size per memtable:
Moved region size per memtable instead of global. From what I understand from the code, each
memtable creates its own NativeAllocator object. So, I think keeping a region size as a member
variable of NativeAllocator makes the region size per memtable. Pls. let me know if that is
not the case & I’ll fix it accordingly.

I don’t think below can be the issue, but want to share in case you see any problems:

* In the race of allocating & setting the current region, in the extreme case there is
a slight chance of allocating next region with the same size (instead of 2x). Consider the
below case:
Thread1: allocates 16K region but has not yet reached to CAS for the current region
Thread 2: allocates 16K, does CAS for the current region. Current region gets filled up and
set back to null by allocate() method.
Thread 1: reaches the CAS. Now, this will set current region to 16K region, instead of 32K.
This sounds a corner case. Btw, even if this happens, there is no harm & next allocation
will be directly of 64K to catch up since we never miss scaling the region size.
To further guard against this, you'll see a below check in the code just before CAS: 
if(region.capacity == regionSize * SCALE_FACTOR || region.capacity == MAX_REGION_SIZE)

* Unslab allocation will happen only after we hit 1 MB region size. Until then, huge payload
can quickly grow the region size & allocate new region. I think this behavior is good
but slight side effect is that we may end up with few partially filled (or not filled?) regions
before we scale up to the correct region size for given the payload. One option I though of
is to have unslabbed allocation threshold and count for each region size, in addition to 1MB
region size. For instance, with 8K region, anything beyond 4K will be unslabbed and if we
see 1000 (threshold) of unslab allocations, we increase the region size. But, not too excited
about this since anyway flush may happen before that to reset the region size. I don’t see
much issue leaving as is for now, but let me know if you can think of a better way to address
this.

fyi, below line will print how region allocation works if you want to test. I did quick test
with 1 to 100 static tables and payload size from 100 to 2KB. In a week or two, planning to
try out with 10s of thousands of tables including longevity tests.

logger.info("{} size region allocated in {}", regionSize, this);

This change takes care of only off-heap objects. For other slab allocator (on-heap?), not
sure if it makes sense to do region scaling. 

TODO: Convert multiplication to shifting. Change logger.info to logger.trace. Any refactoring
or  any suggestions you've..


was (Author: pateljay3001):
Hey Benedict, 
I've attached the first cut. Pls. help review. 

Below are some code changes and design choices/trade-offs. 

* Wait-free region scaling and allocations:

** Instead of one global queue of 1 MB race allocated regions, there’re are now set of global
queues, one for each region size (8K, 16K, ..1MB). All queues are global (not per memtable)
so memtables across all the tables can reuse the race allotted regions. Race allocated regions
will never be cleaned during memtable flushes. 

** Thread who wins in the race of setting new region as the current region, also scales the
region size (if it’s not already at the max). This avoids any need for extra synchronization
for atomic region scaling.

* Region size per memtable:
Moved region size per memtable instead of global. As per the code, each memtable creates it’s
own NativeAllocator object. So, I think keeping a region size as a member variable of NativeAllocator
makes the region size per memtable. Pls. let me know if that is not the case & I’ll
take care of it.


I don’t think below can be the issue, but want to share in case you see any problems:

* In the race of allocating & setting the current region, in the extreme case there is
a slight chance of allocating next region with the same size (instead of 2x). Consider the
below case:
Thread1: allocates 16K region but has not yet reach to CAS for the current region
Thread 2: allocates 16K, do CAS for the current region. Current region gets filled up and
set back to null, by allocate() method.
Thread 1: reaches the CAS instruction. Now, this will set current region to 16K region, instead
of 32K.
This sounds very corner case. Btw, even if this happens, there is absolutely zero harm &
next allocation will be directly of 64K to catch up since we never miss scaling the region
size.
To further guard against this, you'll see a below check in the code just before CAS: 
if(region.capacity == regionSize * SCALE_FACTOR || region.capacity == MAX_REGION_SIZE)

* Unslab allocation will happen only after we hit 1 MB region size. Until then, huge payload
can quickly grow the region size & allocate new region. I think this behavior is good
but slight side effect is that we may end up with few partially filled (or not filled?) regions
before we scale up to the correct size given the payload. One option I though of is to have
unslabbed allocation threshold and count for each region size, in addition to 1MB region size.
For instance, with 8K region, anything beyond 4K will be unslabbed and if we see 1000 (threshold)
of unslab allocations, we increase the region size. But, not too excited about this since
anyway flush may happen before that to reset the region size. I don’t see much issue with
this, but let me know..

fyi, below line will print how region allocation works if you want to test. I did quick test
with 1 to 100 static tables and payload size from 100 to 2KB for now. In a week or two, planning
to try out with 10s of thousands of tables including longevity tests.

logger.info("{} size region allocated in {}", regionSize, this);

This change takes care of only off-heap objects. For other slab allocator (on-heap?), not
sure if it makes sense to do region scaling. 

TODO: Convert multiplication to shiffting for optimization. Change logger.info to logger.trace
or even remove it? Refactoring.

Let me know your thoughts, and I’ll finish TODO & attach a new patch.

> Memtable slab allocation should scale logarithmically to improve occupancy rate
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7882
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jay Patel
>            Assignee: Jay Patel
>              Labels: performance
>             Fix For: 2.1.1
>
>         Attachments: trunk-7882.txt
>
>
> CASSANDRA-5935 allows option to disable region-based allocation for on-heap memtables
but there is no option to disable it for off-heap memtables (memtable_allocation_type: offheap_objects).

> Disabling region-based allocation will allow us to pack more tables in the schema since
minimum of 1MB region won't be allocated per table. Downside can be more fragmentation which
should be controllable by using better allocator like JEMalloc.
> How about below option in yaml?:
> memtable_allocation_type: unslabbed_offheap_objects
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message