qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Trieloff <cctriel...@redhat.com>
Subject Re: Flow control behavior of fanout exchange
Date Thu, 05 Nov 2009 13:31:47 GMT


The issue is that with some high core count machines  with multi socket, 
a few things can go wrong. It
starts with some of the value add the hardware may do by using a feature 
called SMIs. These are hardware
interrupts that stop the CPUs, then load some code into the CPU to do 
management stuff or ECC checks, power
(green computing etc...). The bad side is that they 'stop' all the CPUs 
on the machine. we have plotted SMIs
up to 1-2ms on some machines. My employer has worked with quite of a few 
hardware suppliers to certify
a bunch of machines (remove SMIs for Realtime). note in many cases the 
SMI don't impact applications, i.e.
in Java the effects of the GC are larger.

Few other things that go on, NUMA is on a per socket basis, if you run 
multi socket with high core count
and the CPU load is not high enough for the scheduler to keep the CPU 
locality then you can have expensive
memory access and cache effects come into play also getting less 
effective locking. If you are RHEL5.4 I can
provide some setting which will give you NUMA aware  memory allocation, 
which can  increase throughput up
to 75%, and improve latency about 25% for NUMA machines.

Thus the quick experiment of setting the worker-threads = to the cores 
on one socket, increases  the CPU a little for
those threads, and the probability of scheduling off core going down. 
This then 'removes some of the hardware effects'

Obviously if it is run on a SMI free or SMI re profiled machine (the 
fast ones you noted have little or no SMIs)
and numactl & things like cpuspeed are set then the more powerful 
machine will beat the slower one. But in this case the
faster machine is getting in it's own way.


Mike D.. wrote:
> Hi,
> This "matched flow" behavior is quite interesting and luckily we have not
> experienced it when prototyping on our developer machines.
> Would you mind explain a bit Carl why this would happen and what's your
> suggestion to user of QPID? Soon we will test the proof of concept on our
> servers as well. How can we have QPID utilizing both CPUs (8 processors)?
> thanks,
> mike
> Carl Trieloff wrote:
>> I mailed you the deck directly.
>> Carl.
>> Andy Li wrote:
>>> Carl,
>>> Yes, reducing the number of worker threads from #cores + 1 to 4 did 
>>> switch the data center machines to behavior (1). Looks like you've 
>>> diagnosed the issue!
>>> Unfortunately, I couldn't find a copy of your talk at HPC anywhere on 
>>> the web. It's listed on their website, but no files posted.
>>> Thanks,
>>> Andy
>>>     ok, I think I know what might be going on, I believe it is
>>>     hardware related -- Take a look at the
>>>     presentation I did with Lee Fisher at HPC on Wall Street.
>>>     Anyway, try the following and lets see if we can alter the
>>>     behaviour. On the data centre machines
>>>     run qpidd with --workerthreads 4
>>>     If that alters the results i'll expand my theory and how to
>>>     resolve the hardware side.
>>>     Carl.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message