cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage
Date Wed, 01 Apr 2015 13:46:54 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390561#comment-14390561
] 

Benedict edited comment on CASSANDRA-8670 at 4/1/15 1:46 PM:
-------------------------------------------------------------

bq. In most cases we opt to use the channel when available. ... Where we could really do better
is when using compression

You're right. I thought we had more paths leftover, but it is just compression really. That's
a much cleaner state of affairs!

It looks like compression will typically be under the stack buffer size, so we don't have
such a problem, but it would still be nice to move to a DirectByteBuffer compressed output
stream. It should be quite viable since there are now compression methods for working over
these directly, but that should probably be a follow up ticket. Do you want to file, or shall
I?

bq. BufferedDataOutputStreamPlus.close() will clean the buffer even if it might not own the
buffer such as when it is provided to the constructor. It also doesn't check if the channel
is null.

We don't call close in situations where this is a problem

bq. When used from the commit log without a channel it will throw an NPE if it exceeds the
capacity of the buffer when it goes to flush

We presize the buffer correctly

Still, on both these counts we could offer better safety if we wanted, without much cost.
If we used a DataOutputBuffer it would solve these problems, and we could assert that the
final buffer is == the provided buffer...

bq. It almost seems like there is a case for a version just for wrapping fixed size bytebuffers.

Possibly. I'm on the fence, since it's extra code cache and class hierarchy pollution, with
limited positive impact. We would need to duplicate the whole of BufferedDataOutputStreamPlus.
You could benchmark to see if there's an appreciable difference? If the difference is small
(which I expect it is, given branch prediction), I would prefer to avoid the pollution.


was (Author: benedict):
bq. In most cases we opt to use the channel when available. ... Where we could really do better
is when using compression

You're right. I thought we had more paths leftover, but it is just compression really. That's
a much cleaner state of affairs!

It looks like compression will typically be under the stack buffer size, so we don't have
such a problem, but it would still be nice to move to a DirectByteBuffer compressed output
stream. It should be quite viable since there are now compression methods for working over
these directly, but that should probably be a follow up ticket. Do you want to file, or shall
I?

bq. BufferedDataOutputStreamPlus.close() will clean the buffer even if it might not own the
buffer such as when it is provided to the constructor. It also doesn't check if the channel
is null.

We don't call close in situations where this is a problem

bq. When used from the commit log without a channel it will throw an NPE if it exceeds the
capacity of the buffer when it goes to flush

We presize the buffer correctly

Still, on both these counts we could offer better safety if we wanted, without much cost.
If we used a DataOutputBuffer it would solve these problems, and we could assert that the
final buffer is == the provided buffer...

bq. It almost seems like there is a case for a version just for wrapping fixed size bytebuffers.

Possibly. I'm on the fence, since it's extra code cache and class hierarchy pollution, with
limited positive impact. We would need to duplicate the whole of BufferedDataOutputStreamPlus.
You could benchmark to see if there's an appreciable difference? If the difference is small,
I would prefer to avoid the pollution.

> Large columns + NIO memory pooling causes excessive direct memory usage
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-8670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 3.0
>
>         Attachments: OutputStreamBench.java, largecolumn_test.py
>
>
> If you provide a large byte array to NIO and ask it to populate the byte array from a
socket it will allocate a thread local byte buffer that is the size of the requested read
no matter how large it is. Old IO wraps new IO for sockets (but not files) so old IO is effected
as well.
> Even If you are using Buffered{Input | Output}Stream you can end up passing a large byte
array to NIO. The byte array read method will pass the array to NIO directly if it is larger
than the internal buffer.  
> Passing large cells between nodes as part of intra-cluster messaging can cause the NIO
pooled buffers to quickly reach a high watermark and stay there. This ends up costing 2x the
largest cell size because there is a buffer for input and output since they are different
threads. This is further multiplied by the number of nodes in the cluster - 1 since each has
a dedicated thread pair with separate thread locals.
> Anecdotally it appears that the cost is doubled beyond that although it isn't clear why.
Possibly the control connections or possibly there is some way in which multiple 
> Need a workload in CI that tests the advertised limits of cells on a cluster. It would
be reasonable to ratchet down the max direct memory for the test to trigger failures if a
memory pooling issue is introduced. I don't think we need to test concurrently pulling in
a lot of them, but it should at least work serially.
> The obvious fix to address this issue would be to read in smaller chunks when dealing
with large values. I think small should still be relatively large (4 megabytes) so that code
that is reading from a disk can amortize the cost of a seek. It can be hard to tell what the
underlying thing being read from is going to be in some of the contexts where we might choose
to implement switching to reading chunks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message