cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8897) Remove FileCacheService, instead pooling the buffers
Date Tue, 26 May 2015 22:27:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559999#comment-14559999
] 

Benedict commented on CASSANDRA-8897:
-------------------------------------

bq. The cas on the buffer attachment is to protect against two threads simultaneously freeing
the same buffer. Without it we have two unit tests that trigger the assertion in free(): testMultipleThreadsReleaseDifferentBuffer()
and testMultipleThreadsReleaseSameBuffer(). Shall I remove the unit tests and accept we have
a race or restore the cas?

If we double-free, it's a bug, and should be an assertion failure, not a silent success. If
we want to guard against this, the owner should have an idempotent behaviour on its close
method, IMO.

bq. Also the chunks are not recycled right now because the owner is not null but I can add
this.

My expectation was that you would rewrite this whole method, with the tighter bound on time
spent executing. I must admit at the time (I believe I left a comment as such) I thought they
were already not being recycled in your implementation, but it looks like this was due to
another change I made to your code, so I apologise for that confusion. However the recycling
you were doing was racy, which leads onto...

bq. Your latest suggestions have reverted this:

Are you referring to the strategy of leaving the chunk in the queue and allocating from it
if we have no room up front, but there are segments left? There were a few problems with this
approach:

# The order of visitation on allocate meant we could leave a lot of memory unused
# The cost of removal was linear, and relatively expensive being a CLQ
# The number of elements we could end up with on the queue was unbounded, worsening this
# There are a number of possible race conditions that could have lead to a corrupted pool
state, which is I believe the main reason I removed it (and then forgot :)). For instance,
a thread could be executing maybeRecycle, have performed the check to confirm it should recycle,
and then suspend. The allocating thread could then alliocate a buffer, immediately recycle
it (all before the other thread wakes up), and then also execute maybeRecycle; both would
then successfully add it back to the global pool, and then two different threads can have
it simultaneously as their main allocation block. I haven't thought about if this would lead
to any really problematic scenarios, but it is certainly undesirable.

If we have a bounded number of chunks we're allocating from, we will still be able to exploit
this if any of these chunks have their buffers freed while we're still using them. We can
also in a follow up ticket introduce a special pool of buffers we are not actively allocating
from, but which we can do if we run out of room in the global pool and our usual avenue, but
I think we've probably done enough for this ticket now. 

bq. The burn test reports this sort of errors with if (index == 64) in get():

I'll have a closer look at this shortly. The fact that it's occurring in the burn test suggests
it may not be the problem, but that it may be making another problem visible

> Remove FileCacheService, instead pooling the buffers
> ----------------------------------------------------
>
>                 Key: CASSANDRA-8897
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8897
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: 9240_test_results.txt, snapshot-1431582436640-cpu-backtraces.png,
snapshot-1431582436640-cpu-calltree-compression-8897.nps, snapshot-1431582436640-cpu-calltree-compression-trunk.nps
>
>
> After CASSANDRA-8893, a RAR will be a very lightweight object and will not need caching,
so we can eliminate this cache entirely. Instead we should have a pool of buffers that are
page-aligned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message