cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7029) Investigate alternative transport protocols for both client and inter-server communications
Date Tue, 28 Apr 2015 14:54:08 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517133#comment-14517133
] 

Benedict commented on CASSANDRA-7029:
-------------------------------------

Something we should certainly investigate is [Intel's DPDK|https://networkbuilders.intel.com/docs/network_builders_RA_packet_processing.pdf].
If we were to implement a UDP-based protocol we could reasonably easily construct the datagrams
ourselves, and only cross the JNI barrier when we have a number of UDP messages to send (or
some time interval has elapsed). Intel reckons they can push 2M messages per second (both
directions) with packet sizes < 512 bytes, saturating THE CPU, or saturating a NIC with
larger packets. So we could easily drive an entire node's communications on a single core,
with spare capacity left over on even the most performant hosts. If UDP protocols can see
higher throughput without DPDK (plausible, given Aeron's performance, as an example, and the
other papers I linked earlier), this would simply permit us to have multiple UDP transport
layers, so that users with DPDK support could see even higher throughput. 

If we could get a single isolated, high performance and low overhead network transport, we
might well see avenues for a more constrained "thread per core" model as well.

> Investigate alternative transport protocols for both client and inter-server communications
> -------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7029
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7029
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>              Labels: performance
>             Fix For: 3.0
>
>
> There are a number of reasons to think we can do better than TCP for our communications:
> 1) We can actually tolerate sporadic small message losses, so guaranteed delivery isn't
essential (although for larger messages it probably is)
> 2) As shown in \[1\] and \[2\], Linux can behave quite suboptimally with regard to TCP
message delivery when the system is under load. Judging from the theoretical description,
this is likely to apply even when the system-load is not high, but the number of processes
to schedule is high. Cassandra generally has a lot of threads to schedule, so this is quite
pertinent for us. UDP performs substantially better here.
> 3) Even when the system is not under load, UDP has a lower CPU burden, and that burden
is constant regardless of the number of connections it processes. 
> 4) On a simple benchmark on my local PC, using non-blocking IO for UDP and busy spinning
on IO I can actually push 20-40% more throughput through loopback (where TCP should be optimal,
as no latency), even for very small messages. Since we can see networking taking multiple
CPUs' worth of time during a stress test, using a busy-spin for ~100micros after last message
receipt is almost certainly acceptable, especially as we can (ultimately) process inter-server
and client communications on the same thread/socket in this model.
> 5) We can optimise the threading model heavily: since we generally process very small
messages (200 bytes not at all implausible), the thread signalling costs on the processing
thread can actually dramatically impede throughput. In general it costs ~10micros to signal
(and passing the message to another thread for processing in the current model requires signalling).
For 200-byte messages this caps our throughput at 20MB/s.
> I propose to knock up a highly naive UDP-based connection protocol with super-trivial
congestion control over the course of a few days, with the only initial goal being maximum
possible performance (not fairness, reliability, or anything else), and trial it in Netty
(possibly making some changes to Netty to mitigate thread signalling costs). The reason for
knocking up our own here is to get a ceiling on what the absolute limit of potential for this
approach is. Assuming this pans out with performance gains in C* proper, we then look to contributing
to/forking the udt-java project and see how easy it is to bring performance in line with what
we can get with our naive approach (I don't suggest starting here, as the project is using
blocking old-IO, and modifying it with latency in mind may be challenging, and we won't know
for sure what the best case scenario is).
> \[1\] http://test-docdb.fnal.gov/0016/001648/002/Potential%20Performance%20Bottleneck%20in%20Linux%20TCP.PDF
> \[2\] http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=1968;filename=Performance%20Analysis%20of%20Linux%20Networking%20-%20Packet%20Receiving%20(Official).pdf;version=2
> Further related reading:
> http://public.dhe.ibm.com/software/commerce/doc/mft/cdunix/41/UDTWhitepaper.pdf
> https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/14482/ChoiUndPerTcp.pdf?sequence=1
> https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.3762&rep=rep1&type=pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message