Return-Path: Delivered-To: apmail-qpid-users-archive@www.apache.org Received: (qmail 52784 invoked from network); 3 Jun 2009 13:28:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Jun 2009 13:28:44 -0000 Received: (qmail 25967 invoked by uid 500); 3 Jun 2009 13:28:46 -0000 Delivered-To: apmail-qpid-users-archive@qpid.apache.org Received: (qmail 25946 invoked by uid 500); 3 Jun 2009 13:28:46 -0000 Mailing-List: contact users-help@qpid.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@qpid.apache.org Delivered-To: mailing list users@qpid.apache.org Received: (qmail 25890 invoked by uid 99); 3 Jun 2009 13:28:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jun 2009 13:28:46 +0000 X-ASF-Spam-Status: No, hits=-1.2 required=10.0 tests=HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cctrieloff@redhat.com designates 66.187.237.31 as permitted sender) Received: from [66.187.237.31] (HELO mx2.redhat.com) (66.187.237.31) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jun 2009 13:28:34 +0000 Received: from int-mx2.corp.redhat.com (int-mx2.corp.redhat.com [172.16.27.26]) by mx2.redhat.com (8.13.8/8.13.8) with ESMTP id n53DSCCh004061 for ; Wed, 3 Jun 2009 09:28:12 -0400 Received: from ns3.rdu.redhat.com (ns3.rdu.redhat.com [10.11.255.199]) by int-mx2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id n53DSB9j016750 for ; Wed, 3 Jun 2009 09:28:11 -0400 Received: from localhost.localdomain (dhcp-100-19-90.bos.redhat.com [10.16.19.90]) by ns3.rdu.redhat.com (8.13.8/8.13.8) with ESMTP id n53DSAmo001095 for ; Wed, 3 Jun 2009 09:28:10 -0400 Message-ID: <4A2679EE.1090609@redhat.com> Date: Wed, 03 Jun 2009 09:26:06 -0400 From: Carl Trieloff Reply-To: cctrieloff@redhat.com Organization: Red Hat User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: users@qpid.apache.org Subject: Re: worker thread with qpidd References: <1244009664856-3016587.post@n2.nabble.com> <4A26346F.2040900@redhat.com> <1244021249458-3017154.post@n2.nabble.com> In-Reply-To: <1244021249458-3017154.post@n2.nabble.com> Content-Type: multipart/alternative; boundary="------------050208020201010906000305" X-Scanned-By: MIMEDefang 2.58 on 172.16.27.26 X-Virus-Checked: Checked by ClamAV on apache.org --------------050208020201010906000305 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Generally if you set the number of thread larger than the core count your performance will go down as expected. However the reason the option is there, is so that if you pin the process to less than the number of cores, then the thread count can be adjusted. On a 2 core machine with client on the same machine, there are not that many options, as the client and broker will contend for the resources on the machine. My employer had done a report with HP, it is to big to mail out to the list, but here is some basic setup that was done for that. regards Carl. Throughput (Perftest) For throughput, perftest is used to drive the broker for this benchmark. This harness is able to start up multiple producers and consumers in balanced (n:n) or unbalanced configurations (x:y). What the test does: * creates a control queue * starts x:y producers and consumers * waits for all processors to signal they are ready * controller records a timestamp * producers reliably en-queues messages onto the broker as fast as they can * consumers reliably de-queue messages from the broker as fast as they can * once the last message -- which is marked is received, the controller is signaled * controller waits for all complete signals, records timestamp and calculates rate The throughput is the calculated as the total number of messages reliably transferred divided by the time to transfer those messages. Latency (Latencytest) For latency, latencytest is used to drive the broker for this benchmark. This harness is able to produce messages at a specified rate or for a specified number of messages that are timestamped, sent to the broker, looped back to client node. The client will report the minimum, maximum, and average time for a reporting interval when a rate is used, or for all the messages sent when a count is used. Tuning & Parameter Settings For the testing in this paper the systems were not used for any other purposes. Therefore, the configuration and tuning that is detailed should be reviewed when other applications along with MRG Messaging. Processes For the testing performed the following were disabled (unless specified otherwise): SELinux cpuspeed irqbalance haldaemon yum-updatesd smartd setroubleshoot sendmail rpcgssd rpcidmapd rpcsvcgssd rhnsd pcscd mdmonitor mcstrans kdump isdn iptables ip6tables hplip hidd gpm cups bluetooth avahi-daemon restorecond auditd SysCtl The following kernel parameters were added to //etc/sysctl.conf/. net.ipv4.conf.default.arp_filter, net.ipv4.conf.all.arp_filter 1 Only respond to ARP requests on matching interface net.core.rmem_max, net.core.wmem_max 8388608 maximum receive/send socket buffer size in bytes net.core.rmem_default, net.core.wmem_default 262144 default setting of the socket receive/send buffer in bytes. net.ipv4.tcp_rmem, net.ipv4.tcp_wmem 65536 4194304 8388608 Vector of 3 integers: min, default, max min - minimal size of receive/send buffer used by TCP sockets default - default size of receive/send buffer used by TCP sockets max - maximal size of receive/send buffer allowed for automatically selected receiver buffers for TCP socket net.core.netdev_max_backlog 10000 Maximum number of packets, queued on the input side, when the interface receives packets faster than kernel can process them. Applies to non-NAPI devices only. net.ipv4.tcp_window_scaling 0 Enable window scaling as defined in RFC1323. net.ipv4.tcp_mem 262144 4194304 8388608 Vector of 3 integers: low, pressure, high low - below this number of pages TCP is not bothered about its memory appetite. pressure - when amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption and enters memory pressure mode, which is exited when memory consumption falls under "low". high - number of pages allowed for queueing by all TCP sockets. /*Table 1*/ ethtool Some of the options ethtool allows the operator to change relate to coalesce and offload settings. However, during experimentation only changing the ring settings had noticeable effect for throughput testing. # *ethtool -g eth1 * Ring parameters for eth1: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 256 RX Mini: 0 RX Jumbo: 0 TX: 256 # *ethtool -G eth1 rx 2048 tx 2048 * # *ethtool -g eth1 * Ring parameters for eth1: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 2048 RX Mini: 0 RX Jumbo: 0 TX: 2048 # ifconfig ifconfig was used to increase the /maximum transfer unit/ (MTU) to support jumbo frames and to increase /txqueuelen/ for throughput testing when these changes has noticeable effect. # *ifconfig eth1 * eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80 inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0 inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:9 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:480 (480.0 b) TX bytes:594 (594.0 b) Memory:fdee0000-fdf00000 # *ifconfig eth1 mtu 9000 txqueuelen 2000 * # *ifconfig eth1 * eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80 inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0 inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link UP BROADCAST MULTICAST MTU:9000 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:9 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:2000 RX bytes:480 (480.0 b) TX bytes:594 (594.0 b) Memory:fdee0000-fdf00000 # CPU affinity For latency testing, all interrupts from the cores of one CPU socket were reassigned to other cores. The interrupts for the interconnect under test were assigned to cores of this vacated socket. The processes related to the interconnect (e.g. ib_mad, ipoib) were then schedule to run on the vacated cores. The Qpid daemon was also scheduled to run on these or a subset of the vacated cores. How latencytest was scheduled was determined by the results of experiments limiting or not limiting the latencytest test process to certain cores. Experiments with perftest show that usually the best performance was achieved with the affinity settings after a boot and have not been manipulated. Interrupts can be directed to be handled by cores. //proc/interrupts/ can be queried to identify the interrupts for devices and the number of times each CPU/core has handled each interrupt. For each interrupt, a file named //proc/irq//smp_affinity/ contains a hexadecimal mask which controls which cores can respond to specific interrupt. The contents of these files can be queried or set. Processes can be restricted to run on a set of CPUs/cores. taskset can be used to define the list of CPUs/cores that a be scheduled to execute on. The MRG -- Realtime product include an applicaiton, tuna, that allows for easy setting of affinity of interrupts and processes, through a GUI or command line. AMQP parameters Qpid parameters can be specified on the command line, through environment variables or through the Qpid configuration file. The tests were run with the following qpidd options: --auth no turn of connection authentication, makes setting the test environment easier --mgmt-enable no disable the collection of management data --tcp-nodelay disable the batching of packets --worker-threads <#> set the number of IO worker threads to <#> This was only used for latency test, where the range use was between 1 and one more than the numbers of cores in a socket. The default, which was used for throughput, is one more than the total number of active cores. /*Table 2*/ *Table 3* details the options which were specified for /perftest/. For all testing in this paper a /count/ of 200000 was used. Experimentation was used to detect if setting /tcp-nodelay/ was beneficial or not. For each /size/ reported, the /npubs/ and /nsubs/ were set equally from 1 to 8 by powers of 2 while /qt/ was set between 1 to 16 also by powers of 2. The highest value for each /size/ is reported. --nsubs <#> --npubs <#> number of publishers/ subscribers per client --count <#> number of messages send per pub per qt, so total messages = count * qt * (npub+nsub) --qt <#> number of queues being used --size <#> message size --tcp-nodelay disable the batching of packets --protocol used to specify RDMA, default is TCP /*Table 3*/ The parameters that were used for /latencytest/ are listed in *Table 4*. A 10000 message /rate/ was chosen since all the test interconnects would be able to maintain this rate. When specified, the /max-frame-size/ was set to 120 more than the size. When a /max-frame-size/ was specified, /bound-multiplier/ was set to 1. --rate <#> target message rate --size <#> message size --max-frame-size <#> the maximum frame size to request only specified for ethernet interconnects --bounds-multiplier <#> bound size of write queue (as a multiple of the max frame size) only specified for ethernet interconnects --tcp-nodelay disable the batching of packets --protocol used to specify RDMA, default is TCP /*Table 4 */ ft420 wrote: > exchange used: fanout > we are running broker on 2 core machine. fanout send client is also running > on the same windows machine. > there are 3 recv applications running on three separate machines. > > we were trying with-> --worker-thread 9 which gives poor performance > compared to without --worker-threads option > now we have taken --worker-threads 2 as no of processors on the machine > where broker is running is 2. in this case how many threads exactly has to > be used to so as to improve performance > > Thanks > > > > Gordon Sim wrote: > >> ft420 wrote: >> >>> hi, >>> >>> without --worker-thread option pidstat command shows that there are by >>> default 6 threads created >>> with --worker-thread 6 option pidstat command shows that there are 9 i.e. >>> default 6 + 3 threads created. >>> >> Fyi: the extra three threads are timer threads for various different >> tasks. >> >> >>> As per documentation worker threads option is used to improve >>> performance. >>> I checked with --worker-thread 10 and without --worker-thread. >>> direct_producer sends 100000 messages put time increases with >>> --worker-thread 10 as compared to --worler-thread option. >>> >> Running more threads than there are processors will not improve any real >> parallelism. There is also no real value from using more threads than >> you have active connections (so in a test with just one producer and one >> consumer connection you won't see any benefit from having more than 2 >> worker threads). >> >> --------------------------------------------------------------------- >> Apache Qpid - AMQP Messaging Implementation >> Project: http://qpid.apache.org >> Use/Interact: mailto:users-subscribe@qpid.apache.org >> >> >> >> > > --------------050208020201010906000305--