Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B3EF17CB6 for ; Wed, 1 Apr 2015 14:12:09 +0000 (UTC) Received: (qmail 49728 invoked by uid 500); 1 Apr 2015 14:11:53 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 49690 invoked by uid 500); 1 Apr 2015 14:11:53 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 49678 invoked by uid 99); 1 Apr 2015 14:11:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2015 14:11:53 +0000 Date: Wed, 1 Apr 2015 14:11:53 +0000 (UTC) From: "Ariel Weisberg (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-8670) Large columns + NIO memory pooling causes excessive direct memory usage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390595#comment-14390595 ] Ariel Weisberg commented on CASSANDRA-8670: ------------------------------------------- On more thinking I don't think having close clean the buffer if we allow people to supply one is great. It could be double free or use after free if someone makes a mistake. I think we should had least have FileUtil.clean null out the pointer to the buffer before/after cleaning (whichever is possible) so we get as immediate a failure as possible. Duplicates or slices of the buffer won't pick up that the pointer was nulled, but it's better then nothing. We should also assert that the buffer wasn't provided in the constructor and throw an exception if someone did that and then called close. > Large columns + NIO memory pooling causes excessive direct memory usage > ----------------------------------------------------------------------- > > Key: CASSANDRA-8670 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8670 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Ariel Weisberg > Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: OutputStreamBench.java, largecolumn_test.py > > > If you provide a large byte array to NIO and ask it to populate the byte array from a socket it will allocate a thread local byte buffer that is the size of the requested read no matter how large it is. Old IO wraps new IO for sockets (but not files) so old IO is effected as well. > Even If you are using Buffered{Input | Output}Stream you can end up passing a large byte array to NIO. The byte array read method will pass the array to NIO directly if it is larger than the internal buffer. > Passing large cells between nodes as part of intra-cluster messaging can cause the NIO pooled buffers to quickly reach a high watermark and stay there. This ends up costing 2x the largest cell size because there is a buffer for input and output since they are different threads. This is further multiplied by the number of nodes in the cluster - 1 since each has a dedicated thread pair with separate thread locals. > Anecdotally it appears that the cost is doubled beyond that although it isn't clear why. Possibly the control connections or possibly there is some way in which multiple > Need a workload in CI that tests the advertised limits of cells on a cluster. It would be reasonable to ratchet down the max direct memory for the test to trigger failures if a memory pooling issue is introduced. I don't think we need to test concurrently pulling in a lot of them, but it should at least work serially. > The obvious fix to address this issue would be to read in smaller chunks when dealing with large values. I think small should still be relatively large (4 megabytes) so that code that is reading from a disk can amortize the cost of a seek. It can be hard to tell what the underlying thing being read from is going to be in some of the contexts where we might choose to implement switching to reading chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)