db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Pendleton <bpendle...@amberpoint.com>
Subject Re: [jira] Updated: (DERBY-491) Protocol exception when Network Server tries to return ~32K of data or greater in a result set for a Java stored procedure.
Date Thu, 26 Jan 2006 19:04:04 GMT

Hi Army,

I'm really glad that you noticed that the behavior in the regression test
didn't match the comments, because it gave me a chance to go back and really
cement my understanding of this bug fix.

I will update the fix description in the JIRA entry, and I'll also be posting
a new patch with an updated set of comments in the regression test, but I also
wanted to send out this description of the bug fix on the mailing list. This
is going to be another rather long message, but I'm hoping that a least a
couple of you will read it all the way through and tell me if you spot any

 > My guess is that the fix for DERBY-125 and/or DERBY-170 have changed the
 > symptoms of DERBY-491 from "protocol exception" to "hang"

It's actual DERBY-614 that did this, but you were exactly right in your guess.
The details, though, are pretty interesting, so read on :)

Basically, I got myself confused by working on multiple related bug fixes, and
as a result I put some inaccurate comments into the regression test, but I
believe the bottom line is this:
  - DERBY-492 is a duplicate of DERBY-491
  - the repro scripts for DERBY-491 and DERBY-492 both provoke multiple bugs
  - DERBY-491's script provokes DERBY-614, then DERBY-491
  - DERBY-492's script provokes DERBY-125, then DERBY-614, then DERBY-491
  - depending on the combination of fixes that you have, either repro script
    may get either a protocol violation exception, or a hang
  - Once the other fixes (614 and 125) have been made, the most likely symptom
    of DERBY-491 is a hang.

Why are the above statements true? Here's why:

First, a quick summary of the bug fixes in question:
  - DERBY-125: When a large DDM object is segmented across multiple DSS
    continuation blocks, the last byte in each segment is corrupted. The last
    byte in the very first segment is not corrupted. Also, the DSS continuation
    headers for all segments except the last one are set to 0x8000, instead
    of 0xFFFF.
  - DERBY-614: When a QRYDTA message is split according to the LMTBLKPRC
    protocol, the server assumes that the client will always reply with a
    CNTQRY for that result set as its next message, when in fact there are
    many other possible messages the client could send next. Furthermore, when
    splitting a QRYDTA, the server finalizes the chain at that point and awaits
    the CNTQRY, even if it should have moved on to other processing.
  - DERBY-491: When implementing the LMTBLKPRC protocol, the server incorrectly
    uses physical offsets into the DDMWriter buffer as if they were logical
    offsets into the current DSS block. If the current QRYDTA block was not
    the first-and-only DSS block in the DDMWriter buffer, the server would:
    - incorrectly conclude it needed to be split when it didn't
    - extract the wrong data into the Result Set lookaside buffer when splitting
    - and truncate all data in the DDMWriter buffer beyond physical offset 32767

What the repro scripts for DERBY-491 and DERBY-492 do is to provoke the
LMTBLKPRC protocol to fire when the QRYDTA block is not the first-and-only
block in the DDMWriter buffer:

  - The 491 repro script returns multiple result sets, each with a QRYDTA
    block which exceeds 32K in size. The multiple result sets in the response
    are all chained together.
  - The 492 repro script returns a single result set with a single QRYDTA that
    is chained to and preceded by an extremely long DDM object which sends
    column description information for all 700+ columns returned by the proc.

Although, conceptually, DERBY-491 is a single bug which can be described
simply, it manifests itself in the code as 3 separate routines in the
DRDAConnThread/DDMWriter interface which each have their own detailed
symptoms, as follows:
  - When DRDAConnThread called DDMWriter.getOffset, it was attempting to ask
    the question: "is the current DSS block longer than 32K". Because it was
    actually using a physical offset, the code was actually computing the
    answer to the question: "is the current number of unsent bytes in the
    buffer longer than 32K". So, DRDAConnThread was being tricked into thinking
    that it needed to split the QRYDTA block when in fact it may not have
    needed to split the QRYDTA block at all.
  - When DRDAConnThread called DDMWriter.copyDataToEnd, it was attempting
    to fetch the contents of the current DSS block which were beyond the
    32K limit of data in a single DSS block. However, this was only correct
    when the current DSS block began at offset 0 in the raw buffer. In any
    other situation, DRDAConnThread was fetching whatever data began at byte
    32767 in the raw buffer, which could be absolutely any data at all, and
    did not necessarily correspond to any DDM or DSS boundary in the buffer.
    For example, suppose that the raw buffer contained a 5K DSS, a 10K DSS,
    and a 20K DSS. When copyDataToEnd() ran, and returned the data starting at
    raw offset 32K, it was actually returning the data that was at logical
    offset 17K into that 3rd DSS, which doesn't make any sense at all for
    LMTBLKPRC processing.
  - And when DRDAConnThread called DDMWriter.setOffset, it would truncate
    the physical buffer at 32K. Using the example from the previous paragraph,
    this would leave the buffer with 3 chained DSS blocks: the 5K and 10K
    blocks were fine, but the 3rd block was a 17K block which was not at all
    what DRDAConnThread intended to do.

So, now, let's look at how the 491 and 492 repro scripts behave under
various combinations of fixes. Given that the fixes for DERBY-125 and DERBY-614
have already been committed, some of these combinations are hard to create in
practice right now, but they arose during my development of these fixes and
that's what confused me and caused me to mis-describe the bug's symptoms:

First, consider the 491 repro script:

- If you have neither the 614 fix nor the 491 fix, then the 491 repro script
   gives a protocol exception in the client. What happens here is that the
   script intends to return 2 result sets, and it starts off a chained set of
   response messages with a RSLSTRM message indicating that there are 2 result
   sets, but then after it writes the first result set it sees that it needs
   to be split, so it splits the QRYDTA and *sends the chain right then* (this
   is the DERBY-614 bug at work). The client reads the first result set, finds
   the partial QRYDTA, and then goes to read the second result set. Instead of
   finding the second result set, it finds an end-of-chain indicator, and gives
   the protocol exception

      "actual code point -2 does not match expected code point 8709"

   in which -2 is end-of-chain and 8709 is the OPNQRYRM for the second RS.

   This is the symptom that I was referring to with the "Depending on the
   details..." comments in lang/procedure.java, but that comment is not very
   accurate. I will revise the comment to be more precise.

- The same thing happens if you have the 491 fix, but not the 614 fix,
   because although the server doesn't corrupt the data when it splits it, it
   still doesn't send both result sets when it promised to, so the client still
   gets an end-of-chain when it expected the second result set. The details of
   the actual bytes that flow across the wire are slightly different, because
   now that you have the 491 fix, the QRYDTA is split at a true value of 32K,
   rather than being split at appx 15K which is what happened before the 491
   fix, but this just means that the client is able to process a little bit
   more data from the first QRYDTA before it gets the unexpected end-of-chain.

- If you have the 614 fix, but not the 491 fix, then what happens is that you
   get a hang (this is what Army noticed, and what triggered this long message).
   This is because, after the 614 fix, the server attempts to send both
   result sets, but due to the 491 bug the server truncates the message
   mid-block, so the last DSS block in the chain has a length set to 32K but
   only 20+K of data is actually sent, causing the client to patiently wait
   for the rest of the data (the client thinks this is just TCP network
   buffering at work, and that the data will eventually arrive).

   At the time that I was developing and testing the fix for 491, I did not have
   the 614 fix in my path, and so I didn't realize that this combination of
   fixes would have this behavior, and so the regression test comments don't
   reflect that properly.

Secondly, consider the 492 repro fix. There, both the 614 and the 125 fixes
are relevant, as is the 491 fix, so there are a number of cases:

- If you have none of the fixes, you get a hang. This is because the server
   builds up an enormous SQLCINRD message containing the column descriptions
   for the 700+ columns, then when it goes to write the QRYDTA, bug 491 kicks
   in and causes the server to truncate the buffer at the end of the first
   32K, which happens to be in the middle of the segmented SQLCINRD. The server
   then sends the first bits of the SQLCINRD, but the rest of it has all been
   lost, so the client hangs while waiting patiently for the rest of the data.

   Note that in this case the SQLCINRD was actually corrupted, because bug
   125 had kicked in, but then bug 491 caused us to throw away all those
   corrupted segments so we never noticed them.

- If you have the 491 fix, *but not the 125 fix*, then you get a protocol
   exception on the client. In fact, what you get is exactly bug 125. The
   server prepares the enormous SQLCINRD message, then segments it, corrupting
   the last byte of each segment as it goes, but then it proceeds on to build
   the QRYDTA message and sends it correctly. Then client then notices the
   corruption in the SQLCINRD message and reports

     only one of the VCM, VCS length can be greater than 0

   This is, in fact, exactly how I discovered the DERBY-125 off-by-one bug
   in the first place :)

- If you have the fixes for 491 and 125, but somehow you have had a regression
   and so you don't have the 614 fixes, then the 492 repro script turns out to
   reproduce bug 614, at least on the server side. By this I mean that the
   server would do its part to reproduce 614. However, since the symptoms of
   614 depend on the client sending some message *other* than a CNTQRY for the
   split QRYDTA's result set, the protocol exception does not occur, and instead
   what happens in this case is that the test just quietly passes. That's OK,
   because the 492 repro script is not intended to demonstrate 614.

I think that's about all there is to say. If you take the current trunk, and
run the new 491 and 492 regression tests without the 491 fix, you get a hang,
and that's that. I still think that both regression tests are useful, because
they expose different aspects of the interaction between the various parts of
the DSS protocol implementation on the server side, and so I think it's useful
to put both scripts into the regression suite. Even though, if we somehow
accidentally regressed 491, it would be adequate to just have one of those
tests, there are other types of regressions which might be caught by one of
those tests and not by the other.

I will update the regression test to contain better comments and repost the
patch, and I'll update the changes.html file and repost it to contain most
of this information.

Thanks for reading this!


View raw message