httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lu, Yingqi" <yingqi...@intel.com>
Subject RE: Listeners buckets and duplication w/ and w/o SO_REUSEPORT on trunk
Date Fri, 07 Nov 2014 16:52:43 GMT
Hi Yann,

Thanks for your quick email.

Yes, with current implementation, accept mutex is not being removed, just being cut into smaller
ones. My point was with smaller system, the hardware resource is less too so that the maximum
traffic it can drive is not as much as the big systems. In that sense, the child process/bucket
contention may not be hugely increased compared to big system. Running at peak performance
level, the total number of child process should scale with the size of the systems if there
is no other hardware resource limitations. Then, the child process/bucket should maintain
at the similar rate no matter of the system size if we use some reasonable ListenCoresBucketsRatio.

Regarding to the "timeout" issue, I think I did not write it clearly in my last email. Testing
trunk version with ServerLimit=Number_buckets=StartServer, I did not see any connection timeouts
or connection losses. I only saw performance regressions.

The "timeout or connection losses" issues only occur when I tested the approach that create
the listen socket inside child process. In this case, master process does not control any
listen sockets any more, but let each child do it on its own. If I remember correctly, I think
that was your quick prototype a while back after I posted the original patch. In the original
discussion thread, I mentioned the connection issues and the performance degradation as well.


Again, thank you very much for your help!

Yingqi


-----Original Message-----
From: Yann Ylavic [mailto:ylavic.dev@gmail.com] 
Sent: Friday, November 07, 2014 7:49 AM
To: httpd
Subject: Re: Listeners buckets and duplication w/ and w/o SO_REUSEPORT on trunk

Hi Yingqi,

thanks for sharing your results.

On Thu, Nov 6, 2014 at 9:12 PM, Lu, Yingqi <yingqi.lu@intel.com> wrote:
> I do not see any documents regarding to this new configurable flag 
> ListenCoresBucketsRatio (maybe I missed it)

Will do it when possible, good point.

> Regarding to how to make small systems take advantage of this patch, I actually did some
testing on system with less cores. The data show that when system has less than 16 cores,
more than 1 bucket does not bring any throughput and response time benefits. The patch is
used mainly for big systems to resolve the scalability issue. That is the reason why we previously
hard coded the ratio to 8 (impact only on system has 16 cores or more).
>
> The accept_mutex is not much a bottleneck anymore with the current patch implantation.
Current implementation already cut 1 big mutex into multiple smaller mutexes in the multiple
listen statements case (each bucket has its dedicated accept_mutex). To prove this, our data
show performance parity between 1 listen statement (listen 80, no accept_mutex) and 2 listen
statements (listen 192.168.1.1 80, listen 192.168.1.2 80, with accept_mutex) with current
trunk version. Comparing against without SO_REUSEPORT patch, we see 28% performance gain with
1 listen statement case and 69% gain with 2 listen statements case.

With the current implementation and a reasonable number of servers
(children) started, this is surely true, your numbers prove it.
However, the less buckets (CPU cores), the more contention on each bucket (ie. listeners waiting
on the same socket(s)/mutex).
So the results with less cores are quite expected IMHO.

But we can't remove the accept mutex since there will always be more servers than the number
of buckets.

>
> Regarding to the approach that enables each child has its own listen socket, I did some
testing with current trunk version to increase the number of buckets to be equal to a reasonable
serverlimit (this avoids number of child processes changes). I also verified that MaxClient
and ThreadPerChild were set properly. I used single listen statement so that accept_mutex
was disabled. Comparing against the current approach, this has ~25% less throughput with significantly
higher response time.
>
> In addition to this, implementing the listen socket for each child separately has less
performance as well as connection loss/timeout issues with current Linux kernel. Below are
more information/data we collected with "each child process has its own listen socket" approach:
> 1. During the run, we noticed that there are tons of “read timed out” errors. These
errors not only happen when the system is highly utilized, it even happens when system is
only 10% utilized. The response time was high.
> 2. Compared to current trunk implementation, we found "each child has its own listen
socket approach" results in significantly higher (up to 10X) response time at different CPU
utilization levels. At peak performance level, it has 20+% less throughput with tons of “connection
reset” errors in additional to “read timed out” errors. Current trunk implementation
does not have errors.
> 3. During the graceful restart, there are tons of connection losses.

Did you also set StartServers = ServerLimit?
One bucket per child implies that all the children are up to receive connections or the system
may distribute connections to buckets waiting for a child to handle them.
Linux may distribute the connections based on the listen()ing sockets, not the ones currently
being accept()ed by some child.

I don't know your configuration regarding ServerLimit, or more occurrately the number of children
really started during the steady state of the stress test: let that number be S.

I suppose that S >= num_buckets in your tests with the current implementation, so there
is always at least one child to accept() connections on a bucket, so this cannot happen.

I expect that with one bucket per child (listen()ed in the parent process), any number of
cores, no accept mutex, and StartServers = ServerLimit = S, the system distributes evenly
the connections accross all the children, without any "read timeout" or graceful restart issue.
Otherwise there is a(nother) kernel bug not worked around by the current implementation, and
the same thing may happen when (S /
num_buckets) reaches some limit...

>
> Based on the above findings, I think we may want to keep the current 
> approach. It is a clean, working and better performing one :-)

My point is not (at all) to replace the current approach, but maybe have another ListenBuckets*
directive for systems with any number of cores. This would not change the current ListenCoresBucketsRatio
behaviour, just looking at another way to configure/exploit listeners buckets ;)

Regards,
Yann.
Mime
View raw message