Return-Path: X-Original-To: apmail-trafficserver-users-archive@www.apache.org Delivered-To: apmail-trafficserver-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3C9939BA7 for ; Sat, 18 Feb 2012 06:29:08 +0000 (UTC) Received: (qmail 69589 invoked by uid 500); 18 Feb 2012 06:29:08 -0000 Delivered-To: apmail-trafficserver-users-archive@trafficserver.apache.org Received: (qmail 69191 invoked by uid 500); 18 Feb 2012 06:28:58 -0000 Mailing-List: contact users-help@trafficserver.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@trafficserver.apache.org Delivered-To: mailing list users@trafficserver.apache.org Received: (qmail 69162 invoked by uid 99); 18 Feb 2012 06:28:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Feb 2012 06:28:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of briangeffon@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Feb 2012 06:28:52 +0000 Received: by werm1 with SMTP id m1so2824978wer.29 for ; Fri, 17 Feb 2012 22:28:31 -0800 (PST) Received-SPF: pass (google.com: domain of briangeffon@gmail.com designates 10.180.92.165 as permitted sender) client-ip=10.180.92.165; Authentication-Results: mr.google.com; spf=pass (google.com: domain of briangeffon@gmail.com designates 10.180.92.165 as permitted sender) smtp.mail=briangeffon@gmail.com; dkim=pass header.i=briangeffon@gmail.com Received: from mr.google.com ([10.180.92.165]) by 10.180.92.165 with SMTP id cn5mr3114439wib.2.1329546511469 (num_hops = 1); Fri, 17 Feb 2012 22:28:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=5DsypYuYDdAWvgAvc6kVd9vQ7fgrtpgmI3Vpros2F0Y=; b=Py/NcVMYlSzYznf33i2PN2O5zqQKn3wCdOMrI0orDeCDlCGUsmfjcwlA899lrgipqI 3QwBS+T2bvB7oX3RKn2pU8nC7N0vQXRJS9kDFccCogNISinqUfal9CiozInmC439QuFt o0Nxxyk1hVB+nSyG03m5x98qOhAMguaz+tpIg= MIME-Version: 1.0 Received: by 10.180.92.165 with SMTP id cn5mr2684083wib.2.1329546511360; Fri, 17 Feb 2012 22:28:31 -0800 (PST) Received: by 10.223.157.205 with HTTP; Fri, 17 Feb 2012 22:28:31 -0800 (PST) In-Reply-To: References: Date: Fri, 17 Feb 2012 22:28:31 -0800 Message-ID: Subject: Re: crashing on TS_EVENT_TIMEOUT From: Brian Geffon To: users@trafficserver.apache.org Content-Type: multipart/alternative; boundary=f46d043c092685a3cd04b9372b05 --f46d043c092685a3cd04b9372b05 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I'm not sure if this will fully resolve your issue but change your TSContCalls() to not use TS_EVENT_TIMEOUT or TS_EVENT_IMMEDIATE because doing so will cause the event count to become negative, this is a known bug (TS-935) that I'll be working on a fix for. Can you try change these to a new event type and let me know if this helps? Brian On Fri, Feb 17, 2012 at 10:21 PM, Brian Geffon wrote= : > Just out of curiosity, are you by any chance doing a TSContCall() with a > TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT? > > Brian > > 2012/2/17 Whitaker, James X. -ND > > Hi All, >> >> Sorry for being long, and babbling, but it's getting late in the day, an= d >> this is needs to get fixed. There is a fair amount of information, and >> it's hard to organize it. >> >> ATS with our plugin is crashing... Because of an unexpected event "2", >> evidently created by something inside ATS that appears to be going >> berserk, attempting to retry an event, or perhaps to try to demand that >> our plug-in clean up after itself. What I need to know is: what is the >> meaning of of this "event number 2", where in ATS code is it being >> scheduled, and what should we be doing? How do I track down where this = is >> coming from? >> >> I have run through about 100 hypothesis now about where the crash might = be >> coming from, and each time I prove myself wrong. On the other hand, I a= m >> learning a lot about ATS innards, making me wonder if I am soon to be >> committer material... I am amazed at how difficult it is to describe th= is >> crash in a post - this is one of several attempts. >> >> It will sometimes run for hours without incident. Sometimes for maybe a >> minute. >> >> I have been assuming that we were doing something wrong in our plugin. >> But for the life of me, I can't seem to see what we are doing to >> contribute to this, or what part of ATS code could be objecting to what = we >> are doing. The more I dig around, the more the nature of the crash just >> gets more and more baffling=85 >> >> Version is 3.0.1, with our plugin, running on RedHat EL5. >> >> The stack trace in the core file consists of EThread::execute calling in= to >> EThread::process_event calling into the callback function for >> INKContInternal, which calls into the event handling function we >> registered with TSContCreate for a continuation used to assist the >> HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT. We swit= ch >> on the event parameter of the callback, and the default switch case >> catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable = on >> the transaction. The crash occurs when a sanity check on the validity of >> the transaction fails when we attempt to do the obvious: TSHttpTxnReenab= le >> on what appears to be a valid transaction pointer. ATS asserts and shut= s >> down. m_deletable is 1, m_deleted is 0.... So it is not the result of a= n >> attempt to destroy the object... Apparently something is merely trying = to >> give it a nudge and wake it up? The object shows up with a live magic >> number, but the HttpSM pointer coming into arg3 appears to point into >> garbled memory (with a dead flag magic number). >> >> In otherwords, TSCont's intended for assisting HttpSM objects are coming >> to life on their own somehow, or perhaps as an unintended side effect of >> something we are doing? In this case it is happening after a transactio= n >> is over.... Or perhaps after an error? Something put it on the event >> cue... And I have no idea how it is getting there, or why. I haven't >> found the code that does it... It certainly does not appear to be the >> product of anything in our plug-in... and I haven't found documentation >> of why TS_EVENT_TIMEOUT is there, either. >> >> Our code uses TSHttpTxnReenable to bump the transaction along when the >> callback is done processing each stage. Nothing in our code puts this >> TSCont linked to our transaction state machine callback function directl= y >> on the event cue, it is only used in our code to assist the HttpSM class= , >> which hooks it with the apihooks. >> >> I don't see this kind of behavior documented anywhere. >> >> Best theory so far is that certain error conditions might cause us to le= ak >> TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these lea= ky >> TSCont's are being woken up and put on the event Queue long after the >> Transactions disappear from memory and are cleaned up. I can't figure o= ut >> where they might be getting woken up, or why, or why a once in a blue-mo= on >> leaked TSCont that was long abandoned and was never on the event queue >> would be of interest to anything or anyone. The only evidence that they >> might be leaking is that we see these linked to non-operating HttpSM >> objects in the core files. There is at least one spot in the code where >> the callback function issues a Reenable on the transaction object into t= he >> error state without doing a destroy on the TSCont. >> >> Comments in the blacklist example say that the TS_EVENT_TIMEOUT will sho= w >> up with a transaction pointer of NULL, which we are not seeing. Instead= =AD >> we are seeing a non-null pointer. (Also, the blacklist example for how = to >> handle the timeout looks wrong. It looks like the blacklist example >> repeats the previous processing step, rather than going to the 'next >> state' that should occur after the previous one finished. Why would it >> want to repeat?) >> >> In some of the examples and test code, we see TS_EVENT_TIMEOUT and >> handlers that have inconsistent ways of dealing with the event. We have >> not found documentation on it. >> >> The knee jerk reaction was to put in a prototype to pull in the >> transaction sanity check code from InkApi.cc by declaring a prototype to >> it at the top of our plugin file, and call it to decide whether to call >> the TxnReenable. However, the HttpSM objects wind up on freelists, whic= h >> means the memory gets re-allocated. If we then get a call on a stale >> TSCont that points into the freelist memory of HttpSM's, we would likely >> point into a "live" one, in which case we would be operating on the WRON= G >> HttpSM, wreaking havoc. How do we know whether we are pointing to the >> original HttpSM object, and not a new one in it's memory slot? >> >> If we get a timeout, do we just destroy the continuation? Is that any >> kind of answer? What if the transaction is still using it? Do we ignor= e >> the event? What are the repercussions of ignoring the event? Will we >> wind up with connections that die in mid-stream? >> >> In our plugin's main continuation callback handler, when we get a >> TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation >> with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an even= t >> handler/callback, and at the start of the transaction processing when th= at >> event arrives, a series of hooks back into the callback to process vario= us >> stages of the transaction. >> >> I have been digging around for a while now, trying to figure out what li= ne >> or lines of code could possibly be generating this event "2". It seems >> like knowing where that event code is being generated is critical to >> knowing what to do about the anomaly. Looking at code from PluginVC.cc >> and other places, it looks to my eyes that a general design principal in >> ATS is to follow the practice of re-scheduling an event, but NOT >> preserving the original event code when rescheduling. There are many >> places that it appears that the practice followed is if the queued up >> event somehow does not get a mutex lock, the event will be re-issued but >> the event type will be clobbered with the default event type of >> EVENT_INTERVAL that is found in the various header files in >> iocore/eventsystem for the schedule_* methods of the event and event >> processor. I suspect that this type of thing is where the event ID of "= 2" >> (EVENT_INTERVAL) is coming from. >> >> I remember a server design in the past that would issue events like this >> to warn that the timing was bad, to see if the callback wanted to >> terminate a sequence of events with an error. But that server allowed >> RE-ISSUING the delayed event. The call to TS_EVENT_HTTP_CONTINUE will n= ot >> allow the server to re-issue the clobbered event. The code that generat= es >> the transaction events only goes one way -- the code in proxy/HttpSM.cc >> HttpSM::state_api_callout seems to my eyes to only allow an incrementing >> series of events. >> >> ...So if we miss the cue/event that tells us that we need to handle some >> critical step because the event code was masked by a "TIMEOUT" (whatever >> that is?) then it would be impossible for the code to follow a normal pa= th >> because the state information would be lost. It is hard for me to belie= ve >> that the designers would deliberately destroy state information like thi= s >> (the HttpSM class, for example maintains the state that gets passed down= , >> insulating the callback from the event system), so something else must b= e >> going on. >> >> Now, under high levels of load, we see a number of errors. I am wonderi= ng >> if at least some of these errors are from housekeeping events on stale >> TSCont's being woken up by some "garbage collector" I have yet to find i= n >> ATS... And these stale TSConts are doing damage because they point into >> re-incarnated HttpSM memory slots? >> >> Also, there is another "sister" crash showing up in the logs that we do >> not have a core dump for yet, but we do have this: >> >> FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuatio= n >> which is deleted"` >> /usr/local/bin/traffic_server - STACK TRACE: >> >> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5= f92 >> ad] >> >> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f94= 08] >> >> /usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f= 7b9 >> 5] >> >> /usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc= 7)[ >> 0x4aa247] >> >> /usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)= [0x >> 6c138f] >> /usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb] >> /usr/local/bin/traffic_server[0x6c06ce] >> /lib64/libpthread.so.0[0x344fc0673d] >> /lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d] >> >> I suspect they may have the same cause. >> >> >> What is the proper thing to do here? I don't know if this is an ATS bug= , >> a plug-in bug, both, many, or what it is. The leaky continuations idea = is >> wild guess number 100 or thereabouts of what is causing this crash. If = I >> ignore this, and just return 0 on sight of event=3D=3D2 is something rea= lly >> bad going to happen, like a house cleaning function clogging the priorit= y >> queues with leaked continuations trying to wake them up to get the owner >> to clean them up? >> >> Server's usually issue things like this for a reason. It would be nice = to >> have that reason documented, so that people like myself don't find mysel= f >> running around on a wild goose chase. >> >> Thanks, >> >> James Whitaker >> >> > --f46d043c092685a3cd04b9372b05 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
I'm not sure if this will fully resolve your issue but change your= TSContCalls() to not use TS_EVENT_TIMEOUT or TS_EVENT_IMMEDIATE because do= ing so will cause the event count to become negative, this is a known bug (= TS-935) that I'll be working on a fix for. Can you try change these to = a new event type and let me know if this helps?

Brian

On Fri, Feb 17, 2012 at 10:21 PM, Brian Geff= on <briangeffon@gmail.com> wrote:
Just out of curiosity, are you by any chance doing a TSContCall() with a TS= _EVENT_IMMEDIATE or TS_EVENT_TIMEOUT?

Brian

<= div class=3D"gmail_quote">2012/2/17 Whitaker, James X. -ND <= James.X.Whitaker.-ND@disney.com>

Hi All,

Sorry for being long, and babbling, but it's getting late in the day, a= nd
this is needs to get fixed. =A0There is a fair amount of information, and it's hard to organize it.

ATS with our plugin is crashing... =A0Because of an unexpected event "= 2",
evidently created by something inside ATS that appears to be going
berserk, attempting to retry an event, or perhaps to try to demand that
our plug-in clean up after itself. =A0What I need to know is: what is the meaning of of this "event number 2", where in ATS code is it bein= g
scheduled, and what should we be doing? =A0How do I track down where this i= s
coming from?

I have run through about 100 hypothesis now about where the crash might be<= br> coming from, and each time I prove myself wrong. =A0On the other hand, I am=
learning a lot about ATS innards, making me wonder if I am soon to be
committer material... =A0I am amazed at how difficult it is to describe thi= s
crash in a post - this is one of several attempts.

It will sometimes run for hours without incident. =A0Sometimes for maybe a<= br> minute.

I have been assuming that we were doing something wrong in our plugin.
But for the life of me, I can't seem to see what we are doing to
contribute to this, or what part of ATS code could be objecting to what we<= br> are doing. =A0The more I dig around, the more the nature of the crash just<= br> gets more and more baffling=85

Version is 3.0.1, with our plugin, running on RedHat EL5.

The stack trace in the core file consists of EThread::execute calling into<= br> EThread::process_event calling into the callback function for
INKContInternal, which calls into the event handling function we
registered with TSContCreate for a continuation used to assist the
HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT. =A0We switc= h
on the event parameter of the callback, and the default switch case
catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on<= br> the transaction. The crash occurs when a sanity check on the validity of the transaction fails when we attempt to do the obvious: TSHttpTxnReenable<= br> on what appears to be a valid transaction pointer. =A0ATS asserts and shuts=
down. =A0m_deletable is 1, m_deleted is 0.... So it is not the result of an=
attempt to destroy the object... =A0Apparently something is merely trying t= o
give it a nudge and wake it up? =A0The object shows up with a live magic number, but the HttpSM pointer coming into arg3 appears to point into
garbled memory (with a dead flag magic number).

In otherwords, TSCont's intended for assisting HttpSM objects are comin= g
to life on their own somehow, or perhaps as an unintended side effect of something we are doing? =A0In this case it is happening after a transaction=
is over.... Or perhaps after an error? =A0Something put it on the event
cue... And I have no idea how it is getting there, or why. =A0I haven't=
found the code that does it... It certainly does not appear to be the
product of anything in our plug-in... =A0and I haven't found documentat= ion
of why TS_EVENT_TIMEOUT is there, either.

Our code uses TSHttpTxnReenable to bump the transaction along when the
callback is done processing each stage. =A0Nothing in our code puts this TSCont linked to our transaction state machine callback function directly on the event cue, it is only used in our code to assist the HttpSM class, which hooks it with the apihooks.

I don't see this kind of behavior documented anywhere.

Best theory so far is that certain error conditions might cause us to leak<= br> TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these le= aky
TSCont's are being woken up and put on the event Queue long after the Transactions disappear from memory and are cleaned up. =A0I can't figur= e out
where they might be getting woken up, or why, or why a once in a blue-moon<= br> leaked TSCont that was long abandoned and was never on the event queue
would be of interest to anything or anyone. =A0The only evidence that they<= br> might be leaking is that we see these linked to non-operating HttpSM
objects in the core files. =A0There is at least one spot in the code where<= br> the callback function issues a Reenable on the transaction object into the<= br> error state without doing a destroy on the TSCont.

Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show up with a transaction pointer of NULL, which we are not seeing. =A0Instead= =AD
we are seeing a non-null pointer. =A0(Also, the blacklist example for how t= o
handle the timeout looks wrong. It looks like the blacklist example
repeats the previous processing step, rather than going to the 'next state' that should occur after the previous one finished. =A0Why would = it
want to repeat?)

In some of the examples and test code, we see TS_EVENT_TIMEOUT and
handlers that have inconsistent ways of dealing with the event. =A0We have<= br> not found documentation on it.

The knee jerk reaction was to put in a prototype to pull in the
transaction sanity check code from InkApi.cc by declaring a prototype to it at the top of our plugin file, and call it to decide whether to call
the TxnReenable. =A0However, the HttpSM objects wind up on freelists, which=
means the memory gets re-allocated. =A0If we then get a call on a stale
TSCont that points into the freelist memory of HttpSM's, we would likel= y
point into a "live" one, in which case we would be operating on t= he WRONG
HttpSM, wreaking havoc. =A0How do we know whether we are pointing to the original HttpSM object, and not a new one in it's memory slot?

If we get a timeout, do we just destroy the continuation? =A0Is that any kind of answer? =A0What if the transaction is still using it? =A0Do we igno= re
the event? =A0What are the repercussions of ignoring the event? =A0Will we<= br> wind up with connections that die in mid-stream?

In our plugin's main continuation callback handler, when we get a
TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation
with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event handler/callback, and at the start of the transaction processing when that<= br> event arrives, a series of hooks back into the callback to process various<= br> stages of the transaction.

I have been digging around for a while now, trying to figure out what line<= br> or lines of code could possibly be generating this event "2". =A0= It seems
like knowing where that event code is being generated is critical to
knowing what to do about the anomaly. =A0Looking at code from PluginVC.cc and other places, it looks to my eyes that a general design principal in ATS is to follow the practice of re-scheduling an event, but NOT
preserving the original event code when rescheduling. =A0There are many
places that it appears that the practice followed is if the queued up
event somehow does not get a mutex lock, the event will be re-issued but the event type will be clobbered with the default event type of
EVENT_INTERVAL that is found in the various header files in
iocore/eventsystem for the schedule_* methods of the event and event
processor. =A0I suspect that this type of thing is where the event ID of &q= uot;2"
(EVENT_INTERVAL) is coming from.

I remember a server design in the past that would issue events like this to warn that the timing was bad, to see if the callback wanted to
terminate a sequence of events with an error. =A0But that server allowed RE-ISSUING the delayed event. =A0The call to TS_EVENT_HTTP_CONTINUE will no= t
allow the server to re-issue the clobbered event. =A0The code that generate= s
the transaction events only goes one way -- the code in proxy/HttpSM.cc
HttpSM::state_api_callout seems to my eyes to only allow an incrementing series of events.

...So if we miss the cue/event that tells us that we need to handle some critical step because the event code was masked by a "TIMEOUT" (w= hatever
that is?) then it would be impossible for the code to follow a normal path<= br> because the state information would be lost. =A0It is hard for me to believ= e
that the designers would deliberately destroy state information like this (the HttpSM class, for example maintains the state that gets passed down, insulating the callback from the event system), so something else must be going on.

Now, under high levels of load, we see a number of errors. =A0I am wonderin= g
if at least some of these errors are from housekeeping events on stale
TSCont's being woken up by some "garbage collector" I have ye= t to find in
ATS... =A0And these stale TSConts are doing damage because they point into<= br> re-incarnated HttpSM memory slots?

Also, there is another "sister" crash showing up in the logs that= we do
not have a core dump for yet, but we do have this:

FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuat= ion
which is deleted"`
/usr/local/bin/traffic_server - STACK TRACE:
/usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92=
ad]
/usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408]=
/usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9=
5]
/usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[=
0x4aa247]
/usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x=
6c138f]
/usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb]
/usr/local/bin/traffic_server[0x6c06ce]
/lib64/libpthread.so.0[0x344fc0673d]
/lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d]

I suspect they may have the same cause.


What is the proper thing to do here? =A0I don't know if this is an ATS = bug,
a plug-in bug, both, many, or what it is. =A0The leaky continuations idea i= s
wild guess number 100 or thereabouts of what is causing this crash. =A0If I=
ignore this, and just return 0 on sight of event=3D=3D2 is something really=
bad going to happen, like a house cleaning function clogging the priority queues with leaked continuations trying to wake them up to get the owner to clean them up?

Server's usually issue things like this for a reason. =A0It would be ni= ce to
have that reason documented, so that people like myself don't find myse= lf
running around on a wild goose chase.

Thanks,

James Whitaker



--f46d043c092685a3cd04b9372b05--