cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-1632) Thread workflow and cpu affinity
Date Wed, 20 Nov 2013 23:37:36 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828301#comment-13828301
] 

Jason Brown commented on CASSANDRA-1632:
----------------------------------------

OK, so after two months of trying to get thread affinity to rock the world, I have to admit
that I can't get better performance than what we currently get with kernel scheduler. Details
of code/testing follow, but my hunch as to why I didn't see a boost, and in most cases, saw
degradation, lie in with:

- we use many, many threads (over 1000 on some of Netflix's production servers, currently
running c* 1.1), more than than the count of processors. Most studies/literature I found about
thread affinity used a thread count less than or equal to the cpu count, so not much help
there.
- a given request with be passed across several threads/pools, and in order to get the best
cpu cache coherency for that data (which, I believe, would be the best goal for this work),
it's path through the code base should be on the same processor (for L1/L2 cache hits) or
on the same core (hopefully hit L3 cache). I briefly thought about working this in, but the
heavy use of statics and singletons make this a tricky and daunting task as I would need to
either 'kill the damn singletons' (thx, @driftx :)) or carry around some complicated thread
state in thread local-like structures.

First off, using Peter Lawrey's thread affinity (https://github.com/peter-lawrey/Java-Thread-Affinity)
was reasonably easy to use, and was able to easily pin threads to processors. I then created
several cpu affinity strategies:

- EqualSpreadCpuAffinityStrategy -  Intended to be used with an AbstractExecutorService (ThreadPoolExecutor
or ForkJoinPool), this implementation assigns each successively created thread within a pool
to the next sequential CPU, thus attempting to spread the threads around equally amongst the
CPUs. As we have several AbstractExecutorService instances inside of cassandra, this implementation
picks a random CPU to start with as this will avoid overloading the lower numbered CPUs.
- FirstAssignmentCpuAffinityStrategy - pins a thread to the first cpu the kernel assinged
it to.
- NoCpuAffinityStrategy - NOP, used for comparison vs. others, but mainly used for comparing
vs. trunk.

I applied these strategies (one at a time) to the various ThreadPoolExecutors we have (via
overloading the NamedThreadFactory). After many variations, I ended up just applying the thread
affinity to several key places, including OTC, ITC, READ & MUTATE stages, as well as the
native protocol's RequestThreadPoolExecutor and TDisruptorServer (you can see my hacked up
version of the disruptor-thrift lib at https://github.com/jasobrown/disruptor_thrift_server/tree/thread_affinity).

One thing I discovered was that by isolating the CPUs that handle IRQ (exp. disk and network
IO) from cassandra, I did get a modest bump in throughput (~5%). As it depends on the kernel
and the OS's configuration as to whether the disk/network IRQ is pinned to one (or more) specific
CPU, this is a little difficult to abstract out. Anecdotally, the ec2 instaces that i used
for testing always assigned cpu0 for blkio (disk) and cpu1 for network (eth0). Spot checking
other Netflix instances of different instance type and even different kernel version, showed
that the IRQ distribution was not consistent across our nodes (very similar, but not the same).
Thus, we could create a spin-off ticket to have cassandra isolate itself from those cpus,
I think that work should be explored outside this ticket.

The remants of all my coding on thread affinity can be found here https://github.com/jasobrown/cassandra/tree/1632_threadAffinity

Testing:

env: ec2, three nodes in us-west-1
m2.4xlarge, 8 processors, 68G RAM
Linux 3.2.0.52 (Ubunut 12.10 LTS)
cassandra - 8Gb head, 800 new gen
traffic-generating application: @belliottsmith's improved cassandra-stress (https://github.com/belliottsmith/cassandra/tree/iss-6199-stress,
47a96c1f5557f)


Compared with trunk (during early-mid Nov 2013), I found the performance of EqualSpreadCpuAffinityStrategy
to be about 10-20% worse (throughtput and latency). The FirstAssignmentCpuAffinityStrategy
was more or less on par with trunk. 

I found that while thread affinity reduced the CPU migrations of threads (as measured by 'perf
stat -p $pid') by an order of magnitude, there was no appreciable effeciency gain to cassandra
as a whole. As I don't have access to the PMU on ec2 instances (for obtaining metrics like
L1/2/3 cache hit ratio from perf), I could not measure if the thread affinity code actually
made the entire process more efficient. Either way, the latency and throughput performance
metrics obtained at the client (cassandra stress) did not bear out an overall improvement
in cassandra.

If we all we can do at best is match the kernel's scheduler, I feel confortable with putting
the thread affinity for now, unless anybody finds something I've missed or misunderstood.


> Thread workflow and cpu affinity
> --------------------------------
>
>                 Key: CASSANDRA-1632
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1632
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Jason Brown
>              Labels: performance
>         Attachments: threadAff_reads.txt, threadAff_writes.txt
>
>
> Here are some thoughts I wanted to write down, we need to run some serious benchmarks
to see the benefits:
> 1) All thread pools for our stages use a shared queue per stage. For some stages we could
move to a model where each thread has its own queue. This would reduce lock contention on
the shared queue. This workload only suits the stages that have no variance, else you run
into thread starvation. Some stages that this might work: ROW-MUTATION.
> 2) Set cpu affinity for each thread in each stage. If we can pin threads to specific
cores, and control the workflow of a message from Thrift down to each stage, we should see
improvements on reducing L1 cache misses. We would need to build a JNI extension (to set cpu
affinity), as I could not find anywhere in JDK where it was exposed. 
> 3) Batching the delivery of requests across stage boundaries. Peter Schuller hasn't looked
deep enough yet into the JDK, but he thinks there may be significant improvements to be had
there. Especially in high-throughput situations. If on each consumption you were to consume
everything in the queue, rather than implying a synchronization point in between each request.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message