qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Trieloff <cctriel...@redhat.com>
Subject Re: worker thread with qpidd
Date Wed, 03 Jun 2009 13:26:06 GMT


Generally if you set the number of thread larger than the core count 
your performance will go down
as expected. However the reason the option is there, is so that if you 
pin the process to less than
the number of cores, then the thread count can be adjusted.

On a 2 core machine with client on the same machine, there are not that 
many options, as the client
and broker will contend for the resources on the machine.

My employer had done a report with HP, it is to big to mail out to the 
list, but here is some
basic setup that was done for that.


regards
Carl.


      Throughput (Perftest)

For throughput, perftest is used to drive the broker for this benchmark. 
This harness is able to start up multiple producers and consumers in 
balanced (n:n) or unbalanced configurations (x:y).


What the test does:

    *

      creates a control queue

    *

      starts x:y producers and consumers

    *

      waits for all processors to signal they are ready

    *

      controller records a timestamp

    *

      producers reliably en-queues messages onto the broker as fast as
      they can

    *

      consumers reliably de-queue messages from the broker as fast as
      they can

    *

      once the last message -- which is marked is received, the
      controller is signaled

    *

      controller waits for all complete signals, records timestamp and
      calculates rate

The throughput is the calculated as the total number of messages 
reliably transferred divided by the time to transfer those messages.


      Latency (Latencytest)

For latency, latencytest is used to drive the broker for this benchmark. 
This harness is able to produce messages at a specified rate or for a 
specified number of messages that are timestamped, sent to the broker, 
looped back to client node. The client will report the minimum, maximum, 
and average time for a reporting interval when a rate is used, or for 
all the messages sent when a count is used.


    Tuning & Parameter Settings

For the testing in this paper the systems were not used for any other 
purposes. Therefore, the configuration and tuning that is detailed 
should be reviewed when other applications along with MRG Messaging.


      Processes

For the testing performed the following were disabled (unless specified 
otherwise):


SELinux

cpuspeed

irqbalance

haldaemon

yum-updatesd

smartd

setroubleshoot

sendmail

rpcgssd

rpcidmapd

rpcsvcgssd

rhnsd

pcscd

mdmonitor

mcstrans

kdump

isdn

iptables

ip6tables

hplip

hidd

gpm

cups

bluetooth

avahi-daemon

restorecond

auditd


      SysCtl

The following kernel parameters were added to //etc/sysctl.conf/.

net.ipv4.conf.default.arp_filter,

net.ipv4.conf.all.arp_filter

	

1

	

Only respond to ARP requests on matching interface

net.core.rmem_max,

net.core.wmem_max

	

8388608

	

maximum receive/send socket buffer size in bytes

net.core.rmem_default,

net.core.wmem_default

	

262144

	

default setting of the socket receive/send buffer in bytes.

net.ipv4.tcp_rmem,

net.ipv4.tcp_wmem

	

65536

4194304

8388608

	

Vector of 3 integers: min, default, max

min - minimal size of receive/send buffer used by TCP sockets

default - default size of receive/send buffer used by TCP sockets

max - maximal size of receive/send buffer allowed for automatically 
selected receiver buffers for TCP socket

net.core.netdev_max_backlog

	

10000

	

Maximum number of packets, queued on the input side, when the interface 
receives packets faster than kernel can process them. Applies to 
non-NAPI devices only.

net.ipv4.tcp_window_scaling

	

0

	

Enable window scaling as defined in RFC1323.

net.ipv4.tcp_mem

	

262144

4194304

8388608

	

      Vector of 3 integers: low, pressure, high

      low - below this number of pages TCP is not bothered about its
      memory appetite.

      pressure - when amount of memory allocated by TCP exceeds this
      number of pages, TCP moderates its memory consumption and enters
      memory pressure mode, which is exited when memory consumption
      falls under "low".

      high - number of pages allowed for queueing by all TCP sockets.


/*Table 1*/


      ethtool

Some of the options ethtool allows the operator to change relate to 
coalesce and offload settings. However, during experimentation only 
changing the ring settings had noticeable effect for throughput testing.

# *ethtool -g eth1 *

Ring parameters for eth1:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 256

RX Mini: 0

RX Jumbo: 0

TX: 256


# *ethtool -G eth1 rx 2048 tx 2048 *

# *ethtool -g eth1 *

Ring parameters for eth1:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 2048

RX Mini: 0

RX Jumbo: 0

TX: 2048


#


      ifconfig

ifconfig was used to increase the /maximum transfer unit/ (MTU) to 
support jumbo frames and to increase /txqueuelen/ for throughput testing 
when these changes has noticeable effect.

# *ifconfig eth1 *

eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80

inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0

inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:8 errors:0 dropped:0 overruns:0 frame:0

TX packets:9 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)

Memory:fdee0000-fdf00000

# *ifconfig eth1 mtu 9000 txqueuelen 2000 *

# *ifconfig eth1 *

eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80

inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0

inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link

UP BROADCAST MULTICAST MTU:9000 Metric:1

RX packets:8 errors:0 dropped:0 overruns:0 frame:0

TX packets:9 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:2000

RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)

Memory:fdee0000-fdf00000

#


      CPU affinity

For latency testing, all interrupts from the cores of one CPU socket 
were reassigned to other cores. The interrupts for the interconnect 
under test were assigned to cores of this vacated socket. The processes 
related to the interconnect (e.g. ib_mad, ipoib) were then schedule to 
run on the vacated cores. The Qpid daemon was also scheduled to run on 
these or a subset of the vacated cores. How latencytest was scheduled 
was determined by the results of experiments limiting or not limiting 
the latencytest test process to certain cores.


Experiments with perftest show that usually the best performance was 
achieved with the affinity settings after a boot and have not been 
manipulated.


Interrupts can be directed to be handled by cores. //proc/interrupts/ 
can be queried to identify the interrupts for devices and the number of 
times each CPU/core has handled each interrupt. For each interrupt, a 
file named //proc/irq/<IRQ #>/smp_affinity/ contains a hexadecimal mask 
which controls which cores can respond to specific interrupt. The 
contents of these files can be queried or set.


Processes can be restricted to run on a set of CPUs/cores. taskset can 
be used to define the list of CPUs/cores that a be scheduled to execute on.


The MRG -- Realtime product include an applicaiton, tuna, that allows 
for easy setting of affinity of interrupts and processes, through a GUI 
or command line.


      AMQP parameters

Qpid parameters can be specified on the command line, through 
environment variables or through the Qpid configuration file.


The tests were run with the following qpidd options:

--auth no

	

turn of connection authentication, makes setting the test environment easier

--mgmt-enable no

	

disable the collection of management data

--tcp-nodelay

	

disable the batching of packets


--worker-threads <#>

	

set the number of IO worker threads to <#>

This was only used for latency test, where the range use was between 1 
and one more than the numbers of cores in a socket.

The default, which was used for throughput, is one more than the total 
number of active cores.

/*Table 2*/


      *Table 3* details the options which were specified for /perftest/.
      For all testing in this paper a /count/ of 200000 was used.
      Experimentation was used to detect if setting /tcp-nodelay/ was
      beneficial or not. For each /size/ reported, the /npubs/ and
      /nsubs/ were set equally from 1 to 8 by powers of 2 while /qt/ was
      set between 1 to 16 also by powers of 2. The highest value for
      each /size/ is reported.

--nsubs <#>

--npubs <#>

	

number of publishers/ subscribers per client

--count <#>

	

number of messages send per pub per qt,

so total messages = count * qt * (npub+nsub)

--qt <#>

	

number of queues being used

--size <#>

	

message size

--tcp-nodelay

	

disable the batching of packets

--protocol <tcp| rdma>

	

used to specify RDMA, default is TCP

/*Table 3*/


The parameters that were used for /latencytest/ are listed in *Table 4*. 
A 10000 message /rate/ was chosen since all the test interconnects would 
be able to maintain this rate. When specified, the /max-frame-size/ was 
set to 120 more than the size. When a /max-frame-size/ was specified, 
/bound-multiplier/ was set to 1.


--rate <#>

	

target message rate

--size <#>

	

message size

--max-frame-size <#>

	

the maximum frame size to request

only specified for ethernet interconnects

--bounds-multiplier <#>

	

bound size of write queue (as a multiple of the max frame size)

only specified for ethernet interconnects

--tcp-nodelay

	

disable the batching of packets

--protocol <tcp| rdma>

	

used to specify RDMA, default is TCP

/*Table 4 */






ft420 wrote:
> exchange used: fanout 
> we are running broker on 2 core machine. fanout send client is also running
> on the same windows machine.
> there are 3 recv applications running on three separate machines.
>
> we were trying with-> --worker-thread 9 which gives poor performance
> compared to without --worker-threads option 
> now we have taken --worker-threads 2 as no of processors on the machine
> where broker is running is 2. in this case how many threads exactly has to
> be used to so as to improve performance
>
> Thanks
>
>
>
> Gordon Sim wrote:
>   
>> ft420 wrote:
>>     
>>> hi,
>>>
>>> without --worker-thread option pidstat command shows that there are by
>>> default 6 threads created 
>>> with --worker-thread 6 option pidstat command shows that there are 9 i.e.
>>> default 6 + 3 threads created.
>>>       
>> Fyi: the extra three threads are timer threads for various different
>> tasks.
>>
>>     
>>> As per documentation worker threads option is used to improve
>>> performance. 
>>> I checked with --worker-thread 10 and without --worker-thread.
>>> direct_producer sends 100000 messages put time increases with
>>> --worker-thread 10 as compared to --worler-thread option.
>>>       
>> Running more threads than there are processors will not improve any real 
>> parallelism. There is also no real value from using more threads than 
>> you have active connections (so in a test with just one producer and one 
>> consumer connection you won't see any benefit from having more than 2 
>> worker threads).
>>
>> ---------------------------------------------------------------------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:users-subscribe@qpid.apache.org
>>
>>
>>
>>     
>
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message