tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Recent tcnative null-dereference with 8.0.0-RC3 and 7.0.45 [tcnative-1.dll+0x7e23]
Date Thu, 03 Oct 2013 16:34:13 GMT
Mark,

On 10/3/13 11:42 AM, Mark Thomas wrote:
> On 03/10/2013 16:12, Christopher Schultz wrote:
>> On 10/3/13 9:18 AM, Mark Thomas wrote:
> 
>>> Having been responsible for introducing and fixing a number of
>>> these issues I disagree. It is usually fairly easy to tell what
>>> the problem is from the crash report (double close on a socket,
>>> invalid socket in the poller, etc). What is far more difficult is
>>> figuring out how things got into the known invalid state in the
>>> first place. No amount of debug information at the point of the
>>> crash is going to help with that.
>>
>> Agreed. Given that you have been researching these issues (and
>> fixing them -- thanks!) you have a unique perspective: even though
>> it's easy for you to determine the cause of a crash when you can
>> reproduce it, how about getting a crash report with little other
>> context? Would it help you to have more information or is the crash
>> report sufficient?
> 
> If I can't reproduce it then the fall-back is manual code review to
> try and figure out a code path (perhaps with some hints from the OP's
> report) that would result in the observed issue. As long as there is
> enough info in the crash report (there typically is) to figure out
> what is null / invalid then all is good. I haven't yet felt the need
> for further context.

Okay.

>>> A hard to reproduce bug in APR that triggers a crash is no more
>>> or less difficult to work with than a hard to reproduce bug in
>>> NIO that triggers an unexpected socket close.
>>
>> In those two cases -- NIO and APR -- is the NIO case any less 
>> catastrophic?
> 
> Sometimes yes, sometimes no.
>
>> I've been arguing that we should stop the JVM from crashing, but
>> locking-up the NIO connector and rendering the server inert is
>> roughly the same outcome. Is there actually any utility in stopping
>> the JVM from coming down? I guess you could get a thread dump from
>> an otherwise hung Tomcat, but probably not much else.
> 
> With a lock-up of infinite loop you can find out where you are (as
> with APR) but the how you got there is what you really want.
> 
> For an error that would otherwise result in a socket closure then a
> JVM crash is bad.

So there is at least some utility in preventing JVM crashes. The
question is whether or not avoiding a crash can meaningfully recover
that particular socket, poller, etc. If not, crashing is probably not
much better than the alternative.

> On the other hand, a JVM crash is a very strong motivation to fix an
> issue.

For *you*, or for the user? Certainly it is for the user, but given the
number of unfixed crash reports in Bugzilla, it doesn't seem like
tcnative-crash=quick-fix from the team. I've been trying to investigate
them whenever possible but it's hard for me to get more information and,
frankly, I don't know anything about APR socket management itself so my
time is only of limited use.

>>> I can think of several things that would be more useful: - Better
>>> Javadoc for the native methods. I can think of a number of times
>>> where better docs would have saved me a fair amount of time 
>>> debugging unexpected behaviour.
>>
>> Do you mean more documentation about how the method works, or even
>> just a simple description of what happens *at all*?
> 
> Some examples might help:
> - documenting the return codes for the non-blocking reads would have
> highlighted the problem that required the 1.1.28 release

Ok.

> - documenting the return codes for removing sockets from the poller
> would have highlighted the problem that is going to require the 1.1.29
> release

Ok.

> - documenting some of the constraints around using SSL would have
> saved me some time when getting SSL and WebSocket working

Can you be more specific without just writing the documentation
yourself? I'm hoping to help, but I'm not sure what SSL constraints you
are alluding to.

> As I have worked more with APR/native I have discovered various things
> that it would have been a great help to have in the docs. I've tried
> to remember to document them when I have found them but I've probably
> missed some.
> 
>>> - Something to turn an APR error response code into meaningful
>>> test.
>>
>> Can you explain this in a little more detail? For example, an APR
>> error code might be "bad socket", but as you say, the circumstances
>> of the bug are more important than the error code. How could such a
>> code be turned into a test case?
> 
> In the last few days I have come across the following error codes:
> -70014

I'm sure you know already that this is "End of file found"

> -730053

This one isn't valid, as far as I can tell (the error string is
"Unrecognized resolver error"). 70053 is "Error string not specified
yet" so I'm not sure what that one is.

The defined error codes for APR only go up to 70025 and even 70025 says
"Error string not specified".

> I can look them up to figure what they mean but it would save some
> time if the error report included a text version.

tcnative doesn't have an error log, so where could those strings go? Or
were you thinking of having a bridge from Java code into apr_strerr?

What about a program like perror that understands APR error codes? I've
written a simple one that could be helpful, but you probably did that
yourself already.

Anywhere more information can be added, I'm happy to help.

>>> - Refactoring the connectors so all socket access goes through
>>> the SocketWrapper so there is a much smaller set of code to
>>> validate.
>>
>> I'm guessing you are tackling this task slowly over time.
> 
> I am moving slowly in this direction. My ultimate aim is to have the
> connector type specific code only in the Endpoint and the
> SocketWrapper. No idea if that is possible. It is a longer term goal
> for Tomcat 9+ at this point.
> 
> At the very least whenever I add functionality to the connectors (e.g.
> non-blocking IO) I do enough refactoring that I only have to add the
> new stuff once.

Sounds good. Having unified code with only certain aspects separated
into BIO, NIO, and APR will certainly help folks like me understand the
"true" conceptual relationship between all of these components and make
it easier to actually help work on that part of the code.

-chris


Mime
View raw message