cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
Date Thu, 07 May 2015 10:02:02 GMT


Benedict commented on CASSANDRA-8630:

I'm in favour of simplifying this. Focusing on a small number of well designed and optimised
paths for reads is the best route. I think we should also merge functionality with "ByteBufferDataInput"
- if you look at it, you'll see for mmapped files we're actually incurring all of the CPU
overhead for constructing the int/long values too. If we can tolerate this, we can instead
tolerate a check before a read on if we need to move the buffer (so they can share the same
implementation). In fact, this would at the same time permit us to eliminate the weirdness
with multiple file "segments", by having the mmap reader encapsulate that information and
avoid it leaking into the rest of the codebase. If we can merge all of our readers into approximately
one functional implementation of NIO reading, we're in a _much_ better position than we were.

Obviously the main complexity is when a read spans two buffer offsets. The question then becomes
what to do: ideally we want to read from the underlying file at page boundaries (although
right now this is impossible in the common case of compression, so perhaps we shouldn't worry
too much until CASSANDRA-8896 is delivered), but we also want to allocate page-aligned buffers
(and CASSANDRA-8897 currently won't easily offer "just slightly larger than" page-aligned
buffers). So: do we have a slow path for when crossing these boundaries? I don't like that
either, as it will also likely slow down the common case. 

I think the best option is to have a buffer of size min(chunk-size + one page, 2 * chunk-size).
This really requires CASSANDRA-8894, and even then probably requires an increase in the size
of our buffer pool chunks in CASSANDRA-8897, which is quite achievable but may result in a
higher watermark of memory use. We could make the default chunk size 256K (currently it is
64K), which would make it allocate _only_ page-aligned units, which would also simplify some
of its logic but require that we complicate other bits, so that we don't discard 64K because
we need a 68K allocation (i.e. we would need a queue of chunks we're currently able to allocate
from). [~stef1927]: thoughts?

> Faster sequential IO (on compaction, streaming, etc)
> ----------------------------------------------------
>                 Key: CASSANDRA-8630
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core, Tools
>            Reporter: Oleg Anastasyev
>            Assignee: Oleg Anastasyev
>              Labels: performance
>             Fix For: 3.x
>         Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot of CPU is
lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as their matching
write* are implemented with numerous calls of byte by byte read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in either way gives
8x speed increase.
> A patch attached implements<Type> and SequencialWriter.write<Type>
methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and ColumnNameHelper.maxComponents,
which were on my profiler's hotspot method list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% faster  on uncompressed
sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU load - i.e.
compaction; yellow is user - i.e. not compaction related tasks)

This message was sent by Atlassian JIRA

View raw message