cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
Date Fri, 21 Aug 2015 09:59:47 GMT


Stefania commented on CASSANDRA-8630:

I completed the move of segments to the builder, the bulk of the implementation is {{MmappedRegions}},
which is a {{SharedCloseableImpl}} owned by the builder. Each time a {{SegmentedFile}} is
called we copy a snapshot and increment the reference count. Shall we rename {{SegmentedFile}}
and derived classes to somethign else? The compressed segmented file still owns the mmapped
regions as it creates the metadata each time and I wasn't sure how to handle this.

I also fixed a channel ownership problem in the builder, eventually we should really consider
passing the file path to the builder constructor.

The integration with 6230 was a bit harder than I anticipated and some hinting utests were
failing on Jenkins. I had to introduce a flag to force the slow path in {{RebufferingInputStream}}
or else {{ChecksummedDataInput}} would not work. My idea of updating the CRC during rebuffering
is no good because we must actually rely only on what is read by this class, not the underlying
reader, and we must extract a crc only on what's read, not necessarily buffered.

Please note the new CI links (due to renaming of the branch to [8630-3.0|]:

cstar is currently having [issues|],
as soon as it is available I will launch a comparison.

Here are the remaining CR comments:

bq. MemoryInputStream.available() can wrap the addition between buffer.remaining() + Ints.saturatedCast(memRemaining()).
Do the addition and then the saturating cast.

Fixed, thanks.

bq. Why does RandomAccessReader accept a builder and a parameter for initializing the buffer?
Seems like we lose the bonus of a builder a builder allowing a constant signature.

Good point, fixed.

bq. A nit in initializeBuffer, it does firstSegment.value().slice() which implies you want
a subset of the buffer? duplicate() makes it obvious there is no such concern.


bq. I think there is a place for unit tests stressing the 2 gigabyte boundaries. That means
testing available()/length()/remaining() style methods as well as being able to read and seek
with instances of these things that are larger than 2g. Doing it with the actual file based
ones seems bad, but maybe you could intercept those to work with memory so they run fast or
ingloriously mock their inputs.

I've added {{RandomAccessReaderTest.testVeryLarge}}, it uses a fake file channel that just
increments the position and the buffers without reading anything.

bq. For rate limiting is your current solution to consume buffer size bytes from the limiter
at a time for both mmap reads and standard? And you accomplish this by slicing the buffer
then updating the position? I don't see you setting the limit before slicing?

I've added the call to {{limit()}} in {{rebufferMmap()}}. I missed that initially. The downside
is that we are rebuffering more often than needed for mmap disk access, however we have more
accurate throttling. I've run again the same micro-benchmark and noticed no difference in

bq. I thought NIODataInputStream had methods for reading into ByteBuffers, but I was wrong.
It's kind of thorny to add one to RebufferingInputStream so I think you did the right thing
putting it in FileSegmentedInputStream even though it's an odd concern to have in that class.
Unless you have a better idea.

Shall we move {{readBytes}} from {{FileDataInput}} to {{DataInputPlus}}, at which point it
can be implemented only by {{RebufferingInputStream}}?

bq. In SSTableReader you are adding and removing fields from files. What are the cross version
compatibility issues with that?

It should be fixed now, see {{Version.hasBoundaries()}}. It's currently set to ignore boundaries
for >= 3.0.

> Faster sequential IO (on compaction, streaming, etc)
> ----------------------------------------------------
>                 Key: CASSANDRA-8630
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core, Tools
>            Reporter: Oleg Anastasyev
>            Assignee: Stefania
>              Labels: compaction, performance
>             Fix For: 3.x
>         Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, flight_recorder_001_files.tar.gz,
flight_recorder_002_files.tar.gz, mmaped_uncomp_hotspot.png
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot of CPU is
lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as their matching
write* are implemented with numerous calls of byte by byte read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in either way gives
8x speed increase.
> A patch attached implements<Type> and SequencialWriter.write<Type>
methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and ColumnNameHelper.maxComponents,
which were on my profiler's hotspot method list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% faster  on uncompressed
sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU load - i.e.
compaction; yellow is user - i.e. not compaction related tasks)

This message was sent by Atlassian JIRA

View raw message