tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Recent tcnative null-dereference with 8.0.0-RC3 and 7.0.45 [tcnative-1.dll+0x7e23]
Date Thu, 03 Oct 2013 15:12:11 GMT
Mark,

On 10/3/13 9:18 AM, Mark Thomas wrote:
> On 03/10/2013 13:51, Christopher Schultz wrote:
>> Sebb,
>>
>> On 10/3/13 8:06 AM, sebb wrote:
>>> The problem is that bugs that reveal themselves as JVM crashes
>>> are much harder to debug.
>>
>> +1
>>
>> This is exactly the point I was arguing. When we get a JVM crash
>> report, the stack dump could be completely different depending upon
>> which architecture, kernel, compiler, optimization flags, etc. that
>> were used when compiling and/or running the library. Converting
>> SIGSEGV into an exception gives us a lot of freedom to publish tons
>> of useful information when reporting errors to the user.
>>
>> I'd rather get a report from a user that says something like "here
>> is my stack trace and error message, complete with name of variable
>> that was NULL and line of source code" rather than "here's my JVM
>> crash report, sorry I didn't get a core file, I'm having trouble
>> reproducing the error" (which is essentially what all
>> currently-open JVM-crash error reports look like in BZ for
>> tcnative).
> 
> Having been responsible for introducing and fixing a number of these
> issues I disagree. It is usually fairly easy to tell what the problem
> is from the crash report (double close on a socket, invalid socket in
> the poller, etc). What is far more difficult is figuring out how
> things got into the known invalid state in the first place. No amount
> of debug information at the point of the crash is going to help with that.

Agreed. Given that you have been researching these issues (and fixing
them -- thanks!) you have a unique perspective: even though it's easy
for you to determine the cause of a crash when you can reproduce it, how
about getting a crash report with little other context? Would it help
you to have more information or is the crash report sufficient?

> A hard to reproduce bug in APR that triggers a crash is no more or
> less difficult to work with than a hard to reproduce bug in NIO that
> triggers an unexpected socket close.

In those two cases -- NIO and APR -- is the NIO case any less
catastrophic? I've been arguing that we should stop the JVM from
crashing, but locking-up the NIO connector and rendering the server
inert is roughly the same outcome. Is there actually any utility in
stopping the JVM from coming down? I guess you could get a thread dump
from an otherwise hung Tomcat, but probably not much else.

> You may have noticed that I have slowly been adding debug code to the
> low level connector code, primarily in the Endpoints and the Protocol
> implementations. All of this debug code has proven useful in tracking
> down the type of bug that triggers a crash with APR.
> 
> Additional validity checks in the native code provide for a more
> graceful failure mode but offer little other benefit as the useful
> information is more focused on how the current state was achieved
> rather than what the current state is.

Agreed. Remember, I was just trying to stop crashes. We have a load of
crash reports in various versions of tcnative and if the problem
actually isn't in tcnative, it would be nice to get those reported
against the components that actually have bugs.

> I'm -0 on adding additional checks to the native code.

Noted.

> I can think of several things that would be more useful:
> - Better Javadoc for the native methods. I can think of a number of
> times where better docs would have saved me a fair amount of time
> debugging unexpected behaviour.

Do you mean more documentation about how the method works, or even just
a simple description of what happens *at all*?

> - Something to turn an APR error response code into meaningful test.

Can you explain this in a little more detail? For example, an APR error
code might be "bad socket", but as you say, the circumstances of the bug
are more important than the error code. How could such a code be turned
into a test case?

> - Refactoring the connectors so all socket access goes through the
>   SocketWrapper so there is a much smaller set of code to validate.

I'm guessing you are tackling this task slowly over time.

-chris


Mime
View raw message