Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 131D51849F for ; Mon, 24 Aug 2015 22:40:48 +0000 (UTC) Received: (qmail 83256 invoked by uid 500); 24 Aug 2015 22:40:47 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 83222 invoked by uid 500); 24 Aug 2015 22:40:47 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 83211 invoked by uid 99); 24 Aug 2015 22:40:47 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Aug 2015 22:40:47 +0000 Date: Mon, 24 Aug 2015 22:40:47 +0000 (UTC) From: "Ariel Weisberg (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710183#comment-14710183 ] Ariel Weisberg edited comment on CASSANDRA-8630 at 8/24/15 10:40 PM: --------------------------------------------------------------------- Good stuff, this is all I have at this point. I looked at coverage and things looked good. I think it is close. DataInputBuffer line 25, NIODataInputStream no longer has the bytes shuffling behavior so that comment should go away. RebufferingInputStream copy constructor appears unused (or Eclipse is lying). It's also looks suspicious since it doesn't inherit the rebuffering behavior of whatever it is copying? Does ChecksummedDataInput handle files larger than 2 gigabytes? Seems like we could end up with large hint files? The way the file based hints loop is written it seems like it could do it. Possibly unintentionally. The CoW idiom used for MmappedRegions seems a little off. It's making a copy on read so every SSTableReader (they aren't shared globally I believe) will have a separate deep copy of the entire MmappedRegions. I know this is tricky and you probably get it better than I do, but can you get it so that the same array is shared? Ideally both the arrays and the State object will be shared. Looking at how the refcounting is supposed to work The fact that MmappedRegions and it's owning MmappedSegmentedFile both are SharedClosables seems odd to me. Seems like only one of them needs to determine the lifetime of the whole shebang. For rate limiting. It seems like we acquire buffer size from the rate limiter at a time. What is the potential distribution of buffer sizes and how reasonable are they? It seems like they can vary with the statistics of a file. Since we got into trouble with rate limiting once I just want to be sure there isn't a corner case where it can be a problem again. ChecksummedDataInput test doesn't check for failing checksums. resetCrc(), and readBytes() are also not tested. BufferedRandomAccessFileTest.testAssertionErrorWhenBytesPastMarkIsNegative failed for me. CompressedRandomAccessReader.reBufferMmap() doesn't appear to be tested. was (Author: aweisberg): DataInputBuffer line 25, NIODataInputStream no longer has the bytes shuffling behavior so that comment should go away. RebufferingInputStream copy constructor appears unused (or Eclipse is lying). It's also looks suspicious since it doesn't inherit the rebuffering behavior of whatever it is copying? Does ChecksummedDataInput handle files larger than 2 gigabytes? Seems like we could end up with large hint files? The way the file based hints loop is written it seems like it could do it. Possibly unintentionally. The CoW idiom used for MmappedRegions seems a little off. It's making a copy on read so every SSTableReader (they aren't shared globally I believe) will have a separate deep copy of the entire MmappedRegions. I know this is tricky and you probably get it better than I do, but can you get it so that the same array is shared? Ideally both the arrays and the State object will be shared. Looking at how the refcounting is supposed to work The fact that MmappedRegions and it's owning MmappedSegmentedFile both are SharedClosables seems odd to me. Seems like only one of them needs to determine the lifetime of the whole shebang. For rate limiting. It seems like we acquire buffer size from the rate limiter at a time. What is the potential distribution of buffer sizes and how reasonable are they? It seems like they can vary with the statistics of a file. Since we got into trouble with rate limiting once I just want to be sure there isn't a corner case where it can be a problem again. ChecksummedDataInput test doesn't check for failing checksums. resetCrc(), and readBytes() are also not tested. BufferedRandomAccessFileTest.testAssertionErrorWhenBytesPastMarkIsNegative failed for me. CompressedRandomAccessReader.reBufferMmap() doesn't appear to be tested. > Faster sequential IO (on compaction, streaming, etc) > ---------------------------------------------------- > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools > Reporter: Oleg Anastasyev > Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, flight_recorder_001_files.tar.gz, flight_recorder_002_files.tar.gz, mmaped_uncomp_hotspot.png > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as their matching write* are implemented with numerous calls of byte by byte read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and ColumnNameHelper.maxComponents, which were on my profiler's hotspot method list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)