Mailing-List: contact dev-help@tomcat.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Tomcat Developers List" <dev@tomcat.apache.org>
Subject: Re: Bug that spans tomcat and tomcat-native
To: Tomcat Developers List <dev@tomcat.apache.org>
References: <CAG2YfWPAK9OL50tXB5H987bexmb8BZ2TvX9PddHfo=q26HAvXA@mail.gmail.com>
 <CAG2YfWPMieaN2far8Pzti-pYfKoVkQc9MG0_aXEtMJ-KaL6+XQ@mail.gmail.com>
 <7afb790c-67dc-7dd5-56a0-78961479f571@apache.org>
 <CANwj8ZpjQvOF2JDBjmWjHERy=BjQo=5GUcvx_R6AKH08yTxqAQ@mail.gmail.com>
 <68f53a04-aa52-ce2f-43d7-9a9188b43116@apache.org>
 <CAG2YfWOOwbqiuAEskj_+w96or7vmox8CPey_cuO7eisZKsbwMA@mail.gmail.com>
 <b778a0de-5c9e-6623-db32-bc13ea38a7d9@apache.org>
 <b719b347-193b-5016-a7c6-0ea3e50766bd@apache.org>
 <CAG2YfWMA7WznswdraVZf1vDYWR35XSjQAXkPwgt-g18dSKVcww@mail.gmail.com>
 <CAG2YfWPn=UjEJ3UwRWhBTA07iZwTFTbS3PqHm0G+6Aa=-X051A@mail.gmail.com>
 <8c8ee067-bdda-b6e1-e189-19e36791ddbb@apache.org>
 <CAG2YfWOQR8xaYiC_gAqXfUe7vSUTvdjzy8FeZ=wEtVR2boZctg@mail.gmail.com>
 <4b3edb6e-5649-0d82-cbbe-306fd7e5f8bf@apache.org>
 <CAG2YfWMaSzykgKro9FVah-wHSidhU78KK0uy70uKOzzavaS5bQ@mail.gmail.com>
 <4ec22aa8-a9f4-2f03-ce04-023f3bfd6548@apache.org>
 <CAG2YfWOxV=BotJVA66JCzadmkC_jyukMGXp8q1Rtg5vA5VYidw@mail.gmail.com>
From: Mark Thomas <markt@apache.org>
Message-ID: <7db3621c-15de-6d67-95ce-f806af802569@apache.org>
Date: Mon, 27 Jun 2016 09:32:58 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <CAG2YfWOxV=BotJVA66JCzadmkC_jyukMGXp8q1Rtg5vA5VYidw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 27 Jun 2016 08:33:05 -0000

Thanks. This is all really helpful. I need to spend some time digesting
all of this information, reviewing the OpenSSL code and trying to figure
out what might be going on.

One of the things I am trying to do is to understand why the proposed
patch fixes the issue and if an alternative approach is possible.
Ideally, I'd like to fix this entirely in tc-native so we don't have to
cross the Java-native boundary an additional time which is relatively slow.

Thanks for all you help tracking this down so far.

Mark

On 25/06/2016 00:51, therealneworld@gmail.com wrote:
> On Fri, Jun 24, 2016 at 5:31 PM, Mark Thomas <markt@apache.org> wrote:
>> On 24/06/2016 21:52, therealneworld@gmail.com wrote:
>>
>>>> I'm wondering if this is a problem that builds up over time. If I
>>>> understood your previous posts correctly, running the big tests
>>>> immediately gave ~700MB/s whereas running the small tests then the big
>>>> tests resulting in ~350MB/s during the big tests. Are you able to
>>>> experiment with this a little bit? For example, if you do big tests, 1M
>>>> (~20%) small tests, big tests, 1M small tests, big tests etc. What is
>>>> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests.
>>>
>>> Sure I can try that. For the in between tests do you want me to run
>>> those for a set amount of time or number of files? Like each smaller
>>> batch like 20min and then 10min of large and then next smaller size?
>>
>> I was thinking set number of files.
>>
>> I would also be useful to know how many threads the executor has created
>> at each point as well. (JMX should tell you that. You might need to
>> adjust the executor so it doesn't stop idle threads.).
> 
> I saw your message about not stopping idle threads after I already
> started things.
> 
> 1st 100M test:
>     851348MB/s
>     Executor:
>         largestPoolSize: 25
>         poolSize: 25
> 1st 4k test
>     Executor:
>         largestPoolSize: 401
>         poolSize: 401
> 2nd 100M test:
>      460147MB/s
>     Executor:
>         largestPoolSize: 401
>         poolSize: 79
> 2nd 4k test
>     Executor:
>         largestPoolSize: 414
>         poolSize: 414
> 3rd 100M test:
>     429127MB/s
>     Executor:
>         largestPoolSize: 414
>         poolSize: 80
> 3rd 4k test:
>     Executor:
>         largestPoolSize: 414
>         poolSize: 401
> 4th 100M test:
>     437918MB/s
>     Executor:
>         largestPoolSize: 414
>         poolSize: 86
> 4th 4k test:
>     Executor:
>         largestPoolSize: 414
>         poolSize: 401
> 5th 100M test:
>     464837MB/s
>     Executor:
>         largestPoolSize: 414
>         poolSize: 87
> 
> It looks like the problem occurs right after the first set of 4k puts
> and doesn't get any worse so what ever causes the issue happens early.
> This is getting stranger and I really can not explain why calling
> ERR_remove_thread_state reliably improves performance.
> 
>> Going back to your original description, you said you saw blocking
>> during the call to ERR_clear_err(). Did you mean ERR_clear_error()?
>> Either way, could you provide the full stack trace of an example blocked
>> thread? And, ideally, the stack trace of the thread currently holding
>> the lock? I'm still trying to understand what is going on here since
>> based on my understanding of the code so far, the HashMap is bounded (to
>> the number of threads) and should reach that limit fairly quickly.
> 
> Sorry, yes I did mean ERR_clear_error().
> 
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f0f61ab989d in __GI___pthread_mutex_lock (mutex=0x2b49c58)
> at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f0f3205f183 in int_thread_get (create=0) at err.c:446
> #3  0x00007f0f3205f68d in int_thread_get_item (d=0x7f0ca89c7ce0) at err.c:491
> #4  0x00007f0f32060094 in ERR_get_state () at err.c:1014
> #5  0x00007f0f320602cf in ERR_clear_error () at err.c:747
> #6  0x00007f0f325f3579 in ssl_socket_recv (sock=0x7f0dcc391980,
>     buf=0x7f0eec067820
> "lock->199808-Source_filename->rhino_perf_https_lt_100g_a-Loop->1-Count->11089487-11089488-11089489-11089490-11089491-11089492-11089493-11089494-11089495-11089496-11089497-11089498-11089499-11089500-11"...,
>     len=0x7f0ca89c7ff0) at src/sslnetwork.c:401
> #7  0x00007f0f325ece99 in Java_org_apache_tomcat_jni_Socket_recvbb
> (e=<optimized out>, o=<optimized out>, sock=<optimized out>,
> offset=<optimized out>, len=<optimized out>) at src/network.c:957
> 
> I tried getting more data but the jvm tends to dump core when gdb is
> attached or is going to slow to actually cause the lock contention. I
> can reliably see a thread waiting on this lock if I attach to a single
> thread and randomly interrupt it and look at the back trace. When I
> look at the mutex it has a different owner each time so different
> threads are getting the lock.
> 
> I will play with this a bit more on Monday.
> 
> -nate
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org