Mailing-List: contact users-help@trafficserver.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@trafficserver.apache.org
Received-SPF: pass (athena.apache.org: domain of
 James.X.Whitaker.-ND@disney.com designates 204.128.192.17 as permitted
 sender)
From: "Whitaker, James X. -ND" <James.X.Whitaker.-ND@disney.com>
To: "users@trafficserver.apache.org" <users@trafficserver.apache.org>
Date: Thu, 23 Feb 2012 19:26:35 -0800
Subject: Re: crashing on TS_EVENT_TIMEOUT (RE: TS-935)
Thread-Topic: crashing on TS_EVENT_TIMEOUT (RE: TS-935)
Thread-Index: AczypDkHAdL4+dnPR2+UIGe4LcNCdQ==
Message-ID: <CB6C14C8.C76E%james.x.whitaker.-nd@email.disney.com>
In-Reply-To: <CB693002.C527%james.x.whitaker.-nd@email.disney.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.12.0.110505
acceptlanguage: en-US
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Brian (and everyone else),

I would prefer to see events of TS_EVENT_DESTROY,
TS_EVENT_PLUGIN_REQUESTED_TIMER_1, TS_EVENT_PLUGIN_REQUESTED_TIMER_2...
TS_EVENT_PLUGIN_REQUESTED_TIMER_N, and TS_EVENT_INTERNAL_ERROR=3D2...

WHY?  I am kind of desperate to figure out where these events are coming
from.  I need to be able to track their origins.  I am currently operating
on the assumption that those event's that lead to crashes are left over
from previous continuations that for some reason didn't clean up after
themselves=8A that got destroyed because their event count hit zero and tha=
t
left INKContInternal::destroy started events landing on the next
unsuspecting continuation that occupied the same memory slot.   Or that
land in freed up memory slots for destroyed continuations, resulting in
that second stack trace below.

The current "TS_EVENT_TIMEOUT" arriving into a plugin handler could be any
of those... Which makes debugging and tracing through the code line by
line trying to find the source of "TS_EVENT_TIMEOUT" events that make no
sense at all absolutely baffling.  TS_EVENT_TIMEOUT makes no sense to me
at all.

If you get a TS_EVENT_DESTROY into your callback, you KNOW something in
the event system is out of sync, because it comes from only one place: the
API functions queuing up a destroy on a continuation created for a plugin.

If you get a TIMEOUT#5 event, and that was never supposed to be generated
for THAT continuation, you know you have a recycled memory slot for your
InkContInternal on your hands accepting some other continuation's stale
events that didn't clean up after itself properly and you know the exact
source of the event.  You can write code to defend against the bug
cropping up, or you can try to go after eliminating it ever showing up by
looking for errant mutex locks and etc.

If you get a TS_EVENT_INTERNAL_ERROR, you know you just got an event that
an API function is never supposed to receive...  That's supposed to be an
internal event type generated directly by the schedule_* method default
values, and classes like HttpSM and PluginVC and etc are supposed to
buffer this event type from the plugin and put their own event type stamps
on things.

The check for event count on the INKContInternal should look for these new
events.

TS_EVENT_TIMEOUT should change to have the same value as TIMER#1, and have
TSContSchedule only do TIMER#1, and add a NEW TSContScheduleEvent for the
other timers and destroy events so as to be backwards compatible and C
language compatible (since C compilers don't always accept default
arguments).

Much sanity in debugging plugin events would that bring.

Although, existing code that handled TS_EVENT_TIMEOUT specifically
intended to pick up stray junk like other continuations destroy events and
errant type#2's generated from the bowels of ATS because of bugs would not
function very well.

Ideas?

James Whitaker

>
>------
>
>From:  Brian Geffon <briangeffon@gmail.com>
>Reply-To:  "users@trafficserver.apache.org"
><users@trafficserver.apache.org>
>Date:  Fri, 17 Feb 2012 22:28:31 -0800
>To:  "users@trafficserver.apache.org" <users@trafficserver.apache.org>
>Subject:  Re: crashing on TS_EVENT_TIMEOUT
>
>
>I'm not sure if this will fully resolve your issue but change your
>TSContCalls() to not use TS_EVENT_TIMEOUT or TS_EVENT_IMMEDIATE because
>doing so will cause the event count to become negative, this is a known
>bug (TS-935) that I'll be working on a fix for. Can you try change these
>to a new event type and let me know if this helps?
>
>Brian
>
>On Fri, Feb 17, 2012 at 10:21 PM, Brian Geffon <briangeffon@gmail.com>
>wrote:
>
>
>Just out of curiosity, are you by any chance doing a TSContCall() with a
>TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT?
>Brian
>
>2012/2/17 Whitaker, James X. -ND <James.X.Whitaker.-ND@disney.com>
>
>Hi All,
>
>Sorry for being long, and babbling, but it's getting late in the day, and
>this is needs to get fixed.  There is a fair amount of information, and
>it's hard to organize it.
>
>ATS with our plugin is crashing...  Because of an unexpected event "2",
>evidently created by something inside ATS that appears to be going
>berserk, attempting to retry an event, or perhaps to try to demand that
>our plug-in clean up after itself.  What I need to know is: what is the
>meaning of of this "event number 2", where in ATS code is it being
>scheduled, and what should we be doing?  How do I track down where this is
>coming from?
>
>I have run through about 100 hypothesis now about where the crash might be
>coming from, and each time I prove myself wrong.  On the other hand, I am
>learning a lot about ATS innards, making me wonder if I am soon to be
>committer material...  I am amazed at how difficult it is to describe this
>crash in a post - this is one of several attempts.
>
>It will sometimes run for hours without incident.  Sometimes for maybe a
>minute.
>
>I have been assuming that we were doing something wrong in our plugin.
>But for the life of me, I can't seem to see what we are doing to
>contribute to this, or what part of ATS code could be objecting to what we
>are doing.  The more I dig around, the more the nature of the crash just
>gets more and more baffling=8A
>
>Version is 3.0.1, with our plugin, running on RedHat EL5.
>
>The stack trace in the core file consists of EThread::execute calling into
>EThread::process_event calling into the callback function for
>INKContInternal, which calls into the event handling function we
>registered with TSContCreate for a continuation used to assist the
>HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT.  We switch
>on the event parameter of the callback, and the default switch case
>catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on
>the transaction. The crash occurs when a sanity check on the validity of
>the transaction fails when we attempt to do the obvious: TSHttpTxnReenable
>on what appears to be a valid transaction pointer.  ATS asserts and shuts
>down.  m_deletable is 1, m_deleted is 0.... So it is not the result of an
>attempt to destroy the object...  Apparently something is merely trying to
>give it a nudge and wake it up?  The object shows up with a live magic
>number, but the HttpSM pointer coming into arg3 appears to point into
>garbled memory (with a dead flag magic number).
>
>In otherwords, TSCont's intended for assisting HttpSM objects are coming
>to life on their own somehow, or perhaps as an unintended side effect of
>something we are doing?  In this case it is happening after a transaction
>is over.... Or perhaps after an error?  Something put it on the event
>cue... And I have no idea how it is getting there, or why.  I haven't
>found the code that does it... It certainly does not appear to be the
>product of anything in our plug-in...  and I haven't found documentation
>of why TS_EVENT_TIMEOUT is there, either.
>
>Our code uses TSHttpTxnReenable to bump the transaction along when the
>callback is done processing each stage.  Nothing in our code puts this
>TSCont linked to our transaction state machine callback function directly
>on the event cue, it is only used in our code to assist the HttpSM class,
>which hooks it with the apihooks.
>
>I don't see this kind of behavior documented anywhere.
>
>Best theory so far is that certain error conditions might cause us to leak
>TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky
>TSCont's are being woken up and put on the event Queue long after the
>Transactions disappear from memory and are cleaned up.  I can't figure out
>where they might be getting woken up, or why, or why a once in a blue-moon
>leaked TSCont that was long abandoned and was never on the event queue
>would be of interest to anything or anyone.  The only evidence that they
>might be leaking is that we see these linked to non-operating HttpSM
>objects in the core files.  There is at least one spot in the code where
>the callback function issues a Reenable on the transaction object into the
>error state without doing a destroy on the TSCont.
>
>Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show
>up with a transaction pointer of NULL, which we are not seeing.  Instead=
=AD
>we are seeing a non-null pointer.  (Also, the blacklist example for how to
>handle the timeout looks wrong. It looks like the blacklist example
>repeats the previous processing step, rather than going to the 'next
>state' that should occur after the previous one finished.  Why would it
>want to repeat?)
>
>In some of the examples and test code, we see TS_EVENT_TIMEOUT and
>handlers that have inconsistent ways of dealing with the event.  We have
>not found documentation on it.
>
>The knee jerk reaction was to put in a prototype to pull in the
>transaction sanity check code from InkApi.cc by declaring a prototype to
>it at the top of our plugin file, and call it to decide whether to call
>the TxnReenable.  However, the HttpSM objects wind up on freelists, which
>means the memory gets re-allocated.  If we then get a call on a stale
>TSCont that points into the freelist memory of HttpSM's, we would likely
>point into a "live" one, in which case we would be operating on the WRONG
>HttpSM, wreaking havoc.  How do we know whether we are pointing to the
>original HttpSM object, and not a new one in it's memory slot?
>
>If we get a timeout, do we just destroy the continuation?  Is that any
>kind of answer?  What if the transaction is still using it?  Do we ignore
>the event?  What are the repercussions of ignoring the event?  Will we
>wind up with connections that die in mid-stream?
>
>In our plugin's main continuation callback handler, when we get a
>TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation
>with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event
>handler/callback, and at the start of the transaction processing when that
>event arrives, a series of hooks back into the callback to process various
>stages of the transaction.
>
>I have been digging around for a while now, trying to figure out what line
>or lines of code could possibly be generating this event "2".  It seems
>like knowing where that event code is being generated is critical to
>knowing what to do about the anomaly.  Looking at code from PluginVC.cc
>and other places, it looks to my eyes that a general design principal in
>ATS is to follow the practice of re-scheduling an event, but NOT
>preserving the original event code when rescheduling.  There are many
>places that it appears that the practice followed is if the queued up
>event somehow does not get a mutex lock, the event will be re-issued but
>the event type will be clobbered with the default event type of
>EVENT_INTERVAL that is found in the various header files in
>iocore/eventsystem for the schedule_* methods of the event and event
>processor.  I suspect that this type of thing is where the event ID of "2"
>(EVENT_INTERVAL) is coming from.
>
>I remember a server design in the past that would issue events like this
>to warn that the timing was bad, to see if the callback wanted to
>terminate a sequence of events with an error.  But that server allowed
>RE-ISSUING the delayed event.  The call to TS_EVENT_HTTP_CONTINUE will not
>allow the server to re-issue the clobbered event.  The code that generates
>the transaction events only goes one way -- the code in proxy/HttpSM.cc
>HttpSM::state_api_callout seems to my eyes to only allow an incrementing
>series of events.
>
>...So if we miss the cue/event that tells us that we need to handle some
>critical step because the event code was masked by a "TIMEOUT" (whatever
>that is?) then it would be impossible for the code to follow a normal path
>because the state information would be lost.  It is hard for me to believe
>that the designers would deliberately destroy state information like this
>(the HttpSM class, for example maintains the state that gets passed down,
>insulating the callback from the event system), so something else must be
>going on.
>
>Now, under high levels of load, we see a number of errors.  I am wondering
>if at least some of these errors are from housekeeping events on stale
>TSCont's being woken up by some "garbage collector" I have yet to find in
>ATS...  And these stale TSConts are doing damage because they point into
>re-incarnated HttpSM memory slots?
>
>Also, there is another "sister" crash showing up in the logs that we do
>not have a core dump for yet, but we do have this:
>
>FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation
>which is deleted"`
>/usr/local/bin/traffic_server - STACK TRACE:
>/usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f9
>2
>ad]
>/usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408
>]
>/usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b
>9
>5]
>/usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)
>[
>0x4aa247]
>/usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0
>x
>6c138f]
>/usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb]
>/usr/local/bin/traffic_server[0x6c06ce]
>/lib64/libpthread.so.0[0x344fc0673d]
>/lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d]
>
>I suspect they may have the same cause.
>
>
>What is the proper thing to do here?  I don't know if this is an ATS bug,
>a plug-in bug, both, many, or what it is.  The leaky continuations idea is
>wild guess number 100 or thereabouts of what is causing this crash.  If I
>ignore this, and just return 0 on sight of event=3D=3D2 is something reall=
y
>bad going to happen, like a house cleaning function clogging the priority
>queues with leaked continuations trying to wake them up to get the owner
>to clean them up?
>
>Server's usually issue things like this for a reason.  It would be nice to
>have that reason documented, so that people like myself don't find myself
>running around on a wild goose chase.
>
>Thanks,
>
>James Whitaker
>
>
>
>
>
>
>
>
>
>
>
>
>