Return-Path: X-Original-To: apmail-httpd-dev-archive@www.apache.org Delivered-To: apmail-httpd-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0557D172CF for ; Fri, 7 Nov 2014 15:49:06 +0000 (UTC) Received: (qmail 26237 invoked by uid 500); 7 Nov 2014 15:49:05 -0000 Delivered-To: apmail-httpd-dev-archive@httpd.apache.org Received: (qmail 26173 invoked by uid 500); 7 Nov 2014 15:49:05 -0000 Mailing-List: contact dev-help@httpd.apache.org; run by ezmlm Precedence: bulk Reply-To: dev@httpd.apache.org list-help: list-unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@httpd.apache.org Received: (qmail 26160 invoked by uid 99); 7 Nov 2014 15:49:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Nov 2014 15:49:05 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ylavic.dev@gmail.com designates 209.85.223.174 as permitted sender) Received: from [209.85.223.174] (HELO mail-ie0-f174.google.com) (209.85.223.174) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Nov 2014 15:49:00 +0000 Received: by mail-ie0-f174.google.com with SMTP id x19so5422853ier.19 for ; Fri, 07 Nov 2014 07:48:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=qQd37dpSCJtiE8CDuA3VsCw5gmCQKUBSqu88tiGlS44=; b=pHx93b7JbCSZqWWtSCUHUvJ/UlQohliCgk3vcNj0mVaZJIncFpGRLgo/RLAEm4UotB 3tyJlr1jkLe91a1LdUrHRJyNGW+h+lMWAlUSjfmDrbZZ7aDd97ebaUy0mSpeeBEBNZkU oivyMtrowbv5sFXYAjwc49KF7bgKe6dGaU+LE2L3q9Z7F+5tVt/BjenBjH+Q6Gle6jSt Zg0QQ6ukZxMsQD3tlM/OeHbNyEewjiulccW3kFpkPv8xoyo6DCIfhE6EV2QB6Lwl0auJ dUUo75RxfkLvinQJMohsobL10L6GUrV54w6VcuZxIK4txTRic+tqP7qVotRkvsvs2YtJ KAgA== MIME-Version: 1.0 X-Received: by 10.50.79.166 with SMTP id k6mr4840238igx.0.1415375320171; Fri, 07 Nov 2014 07:48:40 -0800 (PST) Received: by 10.43.87.13 with HTTP; Fri, 7 Nov 2014 07:48:39 -0800 (PST) In-Reply-To: <9ACD5B67AAC5594CB6268234CF29CF9AA37EE17A@ORSMSX113.amr.corp.intel.com> References: <9ACD5B67AAC5594CB6268234CF29CF9AA37D10A9@ORSMSX113.amr.corp.intel.com> <9ACD5B67AAC5594CB6268234CF29CF9AA37D3409@ORSMSX113.amr.corp.intel.com> <9ACD5B67AAC5594CB6268234CF29CF9AA37EE17A@ORSMSX113.amr.corp.intel.com> Date: Fri, 7 Nov 2014 16:48:39 +0100 Message-ID: Subject: Re: Listeners buckets and duplication w/ and w/o SO_REUSEPORT on trunk From: Yann Ylavic To: httpd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Yingqi, thanks for sharing your results. On Thu, Nov 6, 2014 at 9:12 PM, Lu, Yingqi wrote: > I do not see any documents regarding to this new configurable flag Listen= CoresBucketsRatio (maybe I missed it) Will do it when possible, good point. > Regarding to how to make small systems take advantage of this patch, I ac= tually did some testing on system with less cores. The data show that when = system has less than 16 cores, more than 1 bucket does not bring any throug= hput and response time benefits. The patch is used mainly for big systems t= o resolve the scalability issue. That is the reason why we previously hard = coded the ratio to 8 (impact only on system has 16 cores or more). > > The accept_mutex is not much a bottleneck anymore with the current patch = implantation. Current implementation already cut 1 big mutex into multiple = smaller mutexes in the multiple listen statements case (each bucket has its= dedicated accept_mutex). To prove this, our data show performance parity b= etween 1 listen statement (listen 80, no accept_mutex) and 2 listen stateme= nts (listen 192.168.1.1 80, listen 192.168.1.2 80, with accept_mutex) with = current trunk version. Comparing against without SO_REUSEPORT patch, we see= 28% performance gain with 1 listen statement case and 69% gain with 2 list= en statements case. With the current implementation and a reasonable number of servers (children) started, this is surely true, your numbers prove it. However, the less buckets (CPU cores), the more contention on each bucket (ie. listeners waiting on the same socket(s)/mutex). So the results with less cores are quite expected IMHO. But we can't remove the accept mutex since there will always be more servers than the number of buckets. > > Regarding to the approach that enables each child has its own listen sock= et, I did some testing with current trunk version to increase the number of= buckets to be equal to a reasonable serverlimit (this avoids number of chi= ld processes changes). I also verified that MaxClient and ThreadPerChild we= re set properly. I used single listen statement so that accept_mutex was di= sabled. Comparing against the current approach, this has ~25% less throughp= ut with significantly higher response time. > > In addition to this, implementing the listen socket for each child separa= tely has less performance as well as connection loss/timeout issues with cu= rrent Linux kernel. Below are more information/data we collected with "each= child process has its own listen socket" approach: > 1. During the run, we noticed that there are tons of =E2=80=9Cread timed = out=E2=80=9D errors. These errors not only happen when the system is highly= utilized, it even happens when system is only 10% utilized. The response t= ime was high. > 2. Compared to current trunk implementation, we found "each child has its= own listen socket approach" results in significantly higher (up to 10X) re= sponse time at different CPU utilization levels. At peak performance level,= it has 20+% less throughput with tons of =E2=80=9Cconnection reset=E2=80= =9D errors in additional to =E2=80=9Cread timed out=E2=80=9D errors. Curren= t trunk implementation does not have errors. > 3. During the graceful restart, there are tons of connection losses. Did you also set StartServers =3D ServerLimit? One bucket per child implies that all the children are up to receive connections or the system may distribute connections to buckets waiting for a child to handle them. Linux may distribute the connections based on the listen()ing sockets, not the ones currently being accept()ed by some child. I don't know your configuration regarding ServerLimit, or more occurrately the number of children really started during the steady state of the stress test: let that number be S. I suppose that S >=3D num_buckets in your tests with the current implementation, so there is always at least one child to accept() connections on a bucket, so this cannot happen. I expect that with one bucket per child (listen()ed in the parent process), any number of cores, no accept mutex, and StartServers =3D ServerLimit =3D S, the system distributes evenly the connections accross all the children, without any "read timeout" or graceful restart issue. Otherwise there is a(nother) kernel bug not worked around by the current implementation, and the same thing may happen when (S / num_buckets) reaches some limit... > > Based on the above findings, I think we may want to keep the current appr= oach. It is a clean, working and better performing one :-) My point is not (at all) to replace the current approach, but maybe have another ListenBuckets* directive for systems with any number of cores. This would not change the current ListenCoresBucketsRatio behaviour, just looking at another way to configure/exploit listeners buckets ;) Regards, Yann.