cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pederson <>
Subject Re: Bottleneck for small inserts?
Date Thu, 25 May 2017 22:44:04 GMT
Totally understood :)

I forgot to mention - I set the /proc/irq/*/smp_affinity mask to include
all of the CPUs.  Actually most of them were set that way already (for
example, 0000ffff,ffffffff) - it might be because irqbalanced is running.
But for some reason the interrupts are all being handled on CPU 0 anyway.

I see this in /var/log/dmesg on the machines:

> Your BIOS has requested that x2apic be disabled.
> This will leave your machine vulnerable to irq-injection attacks.
> Use 'intremap=no_x2apic_optout' to override BIOS request.
> Enabled IRQ remapping in xapic mode
> x2apic not enabled, IRQ remapping is in xapic mode

In a reply to one of the comments, he says:

When IO-APIC configured to spread interrupts among all cores, it can handle
> up to eight cores. If you have more than eight cores, kernel will not
> configure IO-APIC to spread interrupts. Thus the trick I described in the
> article will not work.
> Otherwise it may be caused by buggy BIOS or even buggy hardware.

I'm not sure if either of them is relevant to my situation.


-- Eric

On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <> wrote:

> You shouldn't need a kernel recompile.  Check out the section "Simple
> solution for the problem" in
> smp-affinity-and-proper-interrupt-handling-in-linux.  You can balance
> your requests across up to 8 CPUs.
> I'll check out the flame graphs in a little bit - in the middle of
> something and my brain doesn't multitask well :)
> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <> wrote:
>> Hi Jonathan -
>> It looks like these machines are configured to use CPU 0 for all I/O
>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>> for them to balance the interrupts across CPUs, but to mitigate the problem
>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>> change the performance though.  Let me know if you think it's crucial that
>> we balance the interrupts across CPUs and I can try to lobby for a new
>> kernel.
>> Here are flamegraphs from each node from a cassandra-stress ingest into
>> a table representative of the what we are going to be using.   This table
>> is also roughly 200 bytes, with 64 columns and the primary key (date,
>> sequence_number).  Cassandra-stress was run on 3 separate client
>> machines.  Using cassandra-stress to write to this table I see the same
>> thing: neither disk, CPU or network is fully utilized.
>>    -
>>    2017/05/flamegraph_ultva01_sars.svg
>>    <>
>>    -
>>    2017/05/flamegraph_ultva02_sars.svg
>>    <>
>>    -
>>    2017/05/flamegraph_ultva03_sars.svg
>>    <>
>> Re: GC: In the stress run with the parameters above, two of the three
>> nodes log zero or one GCInspectors.  On the other hand, the 3rd machine
>> logs a GCInspector every 5 seconds or so, 300-500ms each time.  I found
>> out that the 3rd machine actually has different specs as the other two.
>> It's an older box with the same RAM but less CPUs (32 instead of 48), a
>> slower SSD and slower memory.   The Cassandra configuration is exactly the
>> same.   I tried running Cassandra with only 32 CPUs on the newer boxes to
>> see if that would cause them to GC pause more, but it didn't.
>> On a separate topic - for this cassandra-stress run I reduced the batch
>> size to 2 in order to keep the logs clean.  That also reduced the
>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom C++
>> application.  In most of the tests that I've been doing I've been using a
>> batch size of around 20 (unlogged, all batch rows have the same partition
>> key).  However, it fills the logs with batch size warnings.  I was going to
>> raise the batch warning size but the docs scared me away from doing that.
>> Given that we're using unlogged/same partition batches is it safe to raise
>> the batch size warning limit?   Actually cqlsh COPY FROM has very good
>> throughput using a small batch size, but I can't get that same throughput
>> in cassandra-stress or my C++ app with a batch size of 2.
>> Thanks!
>> -- Eric
>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <>
>> wrote:
>>> How many CPUs are you using for interrupts?
>>> smp-affinity-and-proper-interrupt-handling-in-linux
>>> Have you tried making a flame graph to see where Cassandra is spending
>>> its time?
>>> flame-graphs.html
>>> Are you tracking GC pauses?
>>> Jon
>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <> wrote:
>>>> Hi all:
>>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>>    - 3 node cluster
>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>    - 64 GB RAM per server
>>>>    - 48 cores per server
>>>>    - Cassandra 3.0.11
>>>>    - 48 Gb heap using G1GC
>>>>    - 1 Gbps NICs
>>>> Since I'm using SSD I've tried tuning the following (one at a time) but
>>>> none seemed to make a lot of difference:
>>>>    - concurrent_writes=384
>>>>    - memtable_flush_writers=8
>>>>    - concurrent_compactors=8
>>>> I am currently doing ingestion tests sending data from 3 clients on the
>>>> same subnet.  I am using cassandra-stress to do some ingestion testing.
>>>> The tests are using CL=ONE and RF=2.
>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>>> large enough column size and the standard five column cassandra-stress
>>>> schema.  For example, -col size=fixed(400) will saturate the disk and
>>>> compactions will start falling behind.
>>>> One of our main tables has a row size that approximately 200 bytes,
>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
>>>> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
>>>> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
>>>> spikes around 500 in MutationStage.
>>>> The stress run does 10,000,000 inserts per client, each with a separate
>>>> range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
>>>> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
>>>> The overall performance is good - around 120k rows/sec ingested.  But
>>>> I'm curious to know where the bottleneck is.  There's no resource
>>>> saturation and nodetool tpstats shows only occasional brief queueing.
>>>> Is the rest just expected latency inside of Cassandra?
>>>> Thanks,
>>>> -- Eric

View raw message