Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B09BB200B26 for ; Mon, 27 Jun 2016 10:33:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AF1CB160A5B; Mon, 27 Jun 2016 08:33:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D13EA160A3C for ; Mon, 27 Jun 2016 10:33:04 +0200 (CEST) Received: (qmail 82529 invoked by uid 500); 27 Jun 2016 08:33:03 -0000 Mailing-List: contact dev-help@tomcat.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Tomcat Developers List" Delivered-To: mailing list dev@tomcat.apache.org Received: (qmail 82519 invoked by uid 99); 27 Jun 2016 08:33:03 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2016 08:33:03 +0000 Received: from [192.168.23.9] (host86-152-79-120.range86-152.btcentralplus.com [86.152.79.120]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 35A3D1A010F for ; Mon, 27 Jun 2016 08:33:02 +0000 (UTC) Subject: Re: Bug that spans tomcat and tomcat-native To: Tomcat Developers List References: <7afb790c-67dc-7dd5-56a0-78961479f571@apache.org> <68f53a04-aa52-ce2f-43d7-9a9188b43116@apache.org> <8c8ee067-bdda-b6e1-e189-19e36791ddbb@apache.org> <4b3edb6e-5649-0d82-cbbe-306fd7e5f8bf@apache.org> <4ec22aa8-a9f4-2f03-ce04-023f3bfd6548@apache.org> From: Mark Thomas Message-ID: <7db3621c-15de-6d67-95ce-f806af802569@apache.org> Date: Mon, 27 Jun 2016 09:32:58 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit archived-at: Mon, 27 Jun 2016 08:33:05 -0000 Thanks. This is all really helpful. I need to spend some time digesting all of this information, reviewing the OpenSSL code and trying to figure out what might be going on. One of the things I am trying to do is to understand why the proposed patch fixes the issue and if an alternative approach is possible. Ideally, I'd like to fix this entirely in tc-native so we don't have to cross the Java-native boundary an additional time which is relatively slow. Thanks for all you help tracking this down so far. Mark On 25/06/2016 00:51, therealneworld@gmail.com wrote: > On Fri, Jun 24, 2016 at 5:31 PM, Mark Thomas wrote: >> On 24/06/2016 21:52, therealneworld@gmail.com wrote: >> >>>> I'm wondering if this is a problem that builds up over time. If I >>>> understood your previous posts correctly, running the big tests >>>> immediately gave ~700MB/s whereas running the small tests then the big >>>> tests resulting in ~350MB/s during the big tests. Are you able to >>>> experiment with this a little bit? For example, if you do big tests, 1M >>>> (~20%) small tests, big tests, 1M small tests, big tests etc. What is >>>> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. >>> >>> Sure I can try that. For the in between tests do you want me to run >>> those for a set amount of time or number of files? Like each smaller >>> batch like 20min and then 10min of large and then next smaller size? >> >> I was thinking set number of files. >> >> I would also be useful to know how many threads the executor has created >> at each point as well. (JMX should tell you that. You might need to >> adjust the executor so it doesn't stop idle threads.). > > I saw your message about not stopping idle threads after I already > started things. > > 1st 100M test: > 851348MB/s > Executor: > largestPoolSize: 25 > poolSize: 25 > 1st 4k test > Executor: > largestPoolSize: 401 > poolSize: 401 > 2nd 100M test: > 460147MB/s > Executor: > largestPoolSize: 401 > poolSize: 79 > 2nd 4k test > Executor: > largestPoolSize: 414 > poolSize: 414 > 3rd 100M test: > 429127MB/s > Executor: > largestPoolSize: 414 > poolSize: 80 > 3rd 4k test: > Executor: > largestPoolSize: 414 > poolSize: 401 > 4th 100M test: > 437918MB/s > Executor: > largestPoolSize: 414 > poolSize: 86 > 4th 4k test: > Executor: > largestPoolSize: 414 > poolSize: 401 > 5th 100M test: > 464837MB/s > Executor: > largestPoolSize: 414 > poolSize: 87 > > It looks like the problem occurs right after the first set of 4k puts > and doesn't get any worse so what ever causes the issue happens early. > This is getting stranger and I really can not explain why calling > ERR_remove_thread_state reliably improves performance. > >> Going back to your original description, you said you saw blocking >> during the call to ERR_clear_err(). Did you mean ERR_clear_error()? >> Either way, could you provide the full stack trace of an example blocked >> thread? And, ideally, the stack trace of the thread currently holding >> the lock? I'm still trying to understand what is going on here since >> based on my understanding of the code so far, the HashMap is bounded (to >> the number of threads) and should reach that limit fairly quickly. > > Sorry, yes I did mean ERR_clear_error(). > > #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 > #1 0x00007f0f61ab989d in __GI___pthread_mutex_lock (mutex=0x2b49c58) > at ../nptl/pthread_mutex_lock.c:80 > #2 0x00007f0f3205f183 in int_thread_get (create=0) at err.c:446 > #3 0x00007f0f3205f68d in int_thread_get_item (d=0x7f0ca89c7ce0) at err.c:491 > #4 0x00007f0f32060094 in ERR_get_state () at err.c:1014 > #5 0x00007f0f320602cf in ERR_clear_error () at err.c:747 > #6 0x00007f0f325f3579 in ssl_socket_recv (sock=0x7f0dcc391980, > buf=0x7f0eec067820 > "lock->199808-Source_filename->rhino_perf_https_lt_100g_a-Loop->1-Count->11089487-11089488-11089489-11089490-11089491-11089492-11089493-11089494-11089495-11089496-11089497-11089498-11089499-11089500-11"..., > len=0x7f0ca89c7ff0) at src/sslnetwork.c:401 > #7 0x00007f0f325ece99 in Java_org_apache_tomcat_jni_Socket_recvbb > (e=, o=, sock=, > offset=, len=) at src/network.c:957 > > I tried getting more data but the jvm tends to dump core when gdb is > attached or is going to slow to actually cause the lock contention. I > can reliably see a thread waiting on this lock if I attach to a single > thread and randomly interrupt it and look at the back trace. When I > look at the mutex it has a different owner each time so different > threads are getting the lock. > > I will play with this a bit more on Monday. > > -nate > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org > For additional commands, e-mail: dev-help@tomcat.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org For additional commands, e-mail: dev-help@tomcat.apache.org