I'm not sure if this will fully resolve your issue but change your TSContCalls() to not use TS_EVENT_TIMEOUT or TS_EVENT_IMMEDIATE because doing so will cause the event count to become negative, this is a known bug (TS-935) that I'll be working on a fix for. Can you try change these to a new event type and let me know if this helps?

Brian

On Fri, Feb 17, 2012 at 10:21 PM, Brian Geffon <briangeffon@gmail.com> wrote:
Just out of curiosity, are you by any chance doing a TSContCall() with a TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT?

Brian

2012/2/17 Whitaker, James X. -ND <James.X.Whitaker.-ND@disney.com>

Hi All,

Sorry for being long, and babbling, but it's getting late in the day, and
this is needs to get fixed.  There is a fair amount of information, and
it's hard to organize it.

ATS with our plugin is crashing...  Because of an unexpected event "2",
evidently created by something inside ATS that appears to be going
berserk, attempting to retry an event, or perhaps to try to demand that
our plug-in clean up after itself.  What I need to know is: what is the
meaning of of this "event number 2", where in ATS code is it being
scheduled, and what should we be doing?  How do I track down where this is
coming from?

I have run through about 100 hypothesis now about where the crash might be
coming from, and each time I prove myself wrong.  On the other hand, I am
learning a lot about ATS innards, making me wonder if I am soon to be
committer material...  I am amazed at how difficult it is to describe this
crash in a post - this is one of several attempts.

It will sometimes run for hours without incident.  Sometimes for maybe a
minute.

I have been assuming that we were doing something wrong in our plugin.
But for the life of me, I can't seem to see what we are doing to
contribute to this, or what part of ATS code could be objecting to what we
are doing.  The more I dig around, the more the nature of the crash just
gets more and more baffling…

Version is 3.0.1, with our plugin, running on RedHat EL5.

The stack trace in the core file consists of EThread::execute calling into
EThread::process_event calling into the callback function for
INKContInternal, which calls into the event handling function we
registered with TSContCreate for a continuation used to assist the
HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT.  We switch
on the event parameter of the callback, and the default switch case
catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on
the transaction. The crash occurs when a sanity check on the validity of
the transaction fails when we attempt to do the obvious: TSHttpTxnReenable
on what appears to be a valid transaction pointer.  ATS asserts and shuts
down.  m_deletable is 1, m_deleted is 0.... So it is not the result of an
attempt to destroy the object...  Apparently something is merely trying to
give it a nudge and wake it up?  The object shows up with a live magic
number, but the HttpSM pointer coming into arg3 appears to point into
garbled memory (with a dead flag magic number).

In otherwords, TSCont's intended for assisting HttpSM objects are coming
to life on their own somehow, or perhaps as an unintended side effect of
something we are doing?  In this case it is happening after a transaction
is over.... Or perhaps after an error?  Something put it on the event
cue... And I have no idea how it is getting there, or why.  I haven't
found the code that does it... It certainly does not appear to be the
product of anything in our plug-in...  and I haven't found documentation
of why TS_EVENT_TIMEOUT is there, either.

Our code uses TSHttpTxnReenable to bump the transaction along when the
callback is done processing each stage.  Nothing in our code puts this
TSCont linked to our transaction state machine callback function directly
on the event cue, it is only used in our code to assist the HttpSM class,
which hooks it with the apihooks.

I don't see this kind of behavior documented anywhere.

Best theory so far is that certain error conditions might cause us to leak
TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky
TSCont's are being woken up and put on the event Queue long after the
Transactions disappear from memory and are cleaned up.  I can't figure out
where they might be getting woken up, or why, or why a once in a blue-moon
leaked TSCont that was long abandoned and was never on the event queue
would be of interest to anything or anyone.  The only evidence that they
might be leaking is that we see these linked to non-operating HttpSM
objects in the core files.  There is at least one spot in the code where
the callback function issues a Reenable on the transaction object into the
error state without doing a destroy on the TSCont.

Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show
up with a transaction pointer of NULL, which we are not seeing.  Instead­
we are seeing a non-null pointer.  (Also, the blacklist example for how to
handle the timeout looks wrong. It looks like the blacklist example
repeats the previous processing step, rather than going to the 'next
state' that should occur after the previous one finished.  Why would it
want to repeat?)

In some of the examples and test code, we see TS_EVENT_TIMEOUT and
handlers that have inconsistent ways of dealing with the event.  We have
not found documentation on it.

The knee jerk reaction was to put in a prototype to pull in the
transaction sanity check code from InkApi.cc by declaring a prototype to
it at the top of our plugin file, and call it to decide whether to call
the TxnReenable.  However, the HttpSM objects wind up on freelists, which
means the memory gets re-allocated.  If we then get a call on a stale
TSCont that points into the freelist memory of HttpSM's, we would likely
point into a "live" one, in which case we would be operating on the WRONG
HttpSM, wreaking havoc.  How do we know whether we are pointing to the
original HttpSM object, and not a new one in it's memory slot?

If we get a timeout, do we just destroy the continuation?  Is that any
kind of answer?  What if the transaction is still using it?  Do we ignore
the event?  What are the repercussions of ignoring the event?  Will we
wind up with connections that die in mid-stream?

In our plugin's main continuation callback handler, when we get a
TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation
with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event
handler/callback, and at the start of the transaction processing when that
event arrives, a series of hooks back into the callback to process various
stages of the transaction.

I have been digging around for a while now, trying to figure out what line
or lines of code could possibly be generating this event "2".  It seems
like knowing where that event code is being generated is critical to
knowing what to do about the anomaly.  Looking at code from PluginVC.cc
and other places, it looks to my eyes that a general design principal in
ATS is to follow the practice of re-scheduling an event, but NOT
preserving the original event code when rescheduling.  There are many
places that it appears that the practice followed is if the queued up
event somehow does not get a mutex lock, the event will be re-issued but
the event type will be clobbered with the default event type of
EVENT_INTERVAL that is found in the various header files in
iocore/eventsystem for the schedule_* methods of the event and event
processor.  I suspect that this type of thing is where the event ID of "2"
(EVENT_INTERVAL) is coming from.

I remember a server design in the past that would issue events like this
to warn that the timing was bad, to see if the callback wanted to
terminate a sequence of events with an error.  But that server allowed
RE-ISSUING the delayed event.  The call to TS_EVENT_HTTP_CONTINUE will not
allow the server to re-issue the clobbered event.  The code that generates
the transaction events only goes one way -- the code in proxy/HttpSM.cc
HttpSM::state_api_callout seems to my eyes to only allow an incrementing
series of events.

...So if we miss the cue/event that tells us that we need to handle some
critical step because the event code was masked by a "TIMEOUT" (whatever
that is?) then it would be impossible for the code to follow a normal path
because the state information would be lost.  It is hard for me to believe
that the designers would deliberately destroy state information like this
(the HttpSM class, for example maintains the state that gets passed down,
insulating the callback from the event system), so something else must be
going on.

Now, under high levels of load, we see a number of errors.  I am wondering
if at least some of these errors are from housekeeping events on stale
TSCont's being woken up by some "garbage collector" I have yet to find in
ATS...  And these stale TSConts are doing damage because they point into
re-incarnated HttpSM memory slots?

Also, there is another "sister" crash showing up in the logs that we do
not have a core dump for yet, but we do have this:

FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation
which is deleted"`
/usr/local/bin/traffic_server - STACK TRACE:
/usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92
ad]
/usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408]
/usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9
5]
/usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[
0x4aa247]
/usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x
6c138f]
/usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb]
/usr/local/bin/traffic_server[0x6c06ce]
/lib64/libpthread.so.0[0x344fc0673d]
/lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d]

I suspect they may have the same cause.


What is the proper thing to do here?  I don't know if this is an ATS bug,
a plug-in bug, both, many, or what it is.  The leaky continuations idea is
wild guess number 100 or thereabouts of what is causing this crash.  If I
ignore this, and just return 0 on sight of event==2 is something really
bad going to happen, like a house cleaning function clogging the priority
queues with leaked continuations trying to wake them up to get the owner
to clean them up?

Server's usually issue things like this for a reason.  It would be nice to
have that reason documented, so that people like myself don't find myself
running around on a wild goose chase.

Thanks,

James Whitaker