trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Whitaker, James X. -ND" <James.X.Whitaker....@disney.com>
Subject unexplained long latencies in ATS
Date Thu, 26 Jan 2012 02:34:30 GMT
Hi,

Apologies for the length, but this is a strange one.

The one thing that might be suspicious: I have a VM for development.  My
development environment is VMWare on a mac (i7 MacBookPro), with CentOS
5.7 as the guest OS that hosts ATS.

I have scoured over every line of the ATS source code in the last few
days, to educate myself and to attempt to answer the following question:

Why is every single request going to ATS taking so long to complete?
Callgrind and cachegrind shows approximately one millisecond of CPU time
for one particular example involving a cache hit, but wireshark is showing
response times that range from 13ms to 130ms or more for end-to-end
processing of cache hits, with an average of 50ms.  ATS is apparently
going to sleep for a minimum of 10 milliseconds, typically 50 to 80
milliseconds, and sometimes as long as 20 seconds (when under load) before
wireshark shows ATS issuing either a cached response or a request to a
service that is queued up by our plugin.

As near as I can tell, ATS is putting each and every request to sleep,
rather than just blasting it through.

There are very respectable throughput numbers, but the time for each
request is stunningly slow.

This is absolutely baffling.

There is no I/O pending, no logging, no peer caches active...  nothing
else happening.   No other polling processing, no other requests in the
pipeline other than those issued by my ruby script that sends them out one
at a time.  I start it up cold (cleared cache, rebooted machine, the
works) and have a 512MB ramcache configured.... with 5Gig of RAM allocated
to the VM.  I shut off all logging, all peer caching... anything that can
hit the disks or introduce any kind of dependency, and tell it to serve up
a 100 byte chunk of cached data....

I tried recompiling with EAGER_SIGNALLING and other such possibilities,
and they ALL increased the latency/lag.

A week ago, I was seeing about 5% of requests completing in an acceptable
2.2 milliseconds, but in the last week, the fastest time for processing a
cache hit, on an ATS server being polled by my ruby script that I have
seen is 12.2 milliseconds.  The shortest time showed in wireshark for
issuing an http request queued up by our plugin is 36ms after the initial
GET coming into ATS, but this delay is usually around 130 milliseconds.  I
would expect to see half a millisecond.  When the request comes back over
the wire from the http request instigated by our plugin, ATS issues the
expected reply to the client in less than a millisecond with no apparent
sleep state intervening.

I will run back to back requests from my polling ruby script, and the CPU
usage level is so low it sometimes does not show up in ³top².

The AVERAGE response time is 40ms for a (I would think would be)
ram-cached request I would expect the server to handle in 1ms.  I have
thrown hundreds of polling processes and threads at it, and it's responses
per second and overall throughput is quite respectable....  there is just
this bizarre inexplicable latency where ATS seems to be going to sleep on
every single request before doing anything with it, for absolutely no
reason.

I tried disabling accept threads by setting it to zero in records.config.
I tried every combination of thread counts and ratios I could think of.

Looking over the iocore/eventsystem/UnixEThread.cc I notice that it is
apparently dequeuing from one queue and putting events on another queue
with a TIME TO EXECUTE marker flag associated with that command, and this
is a priority queue that is sorted into bins.  Since the contents of these
priority queues are time based, with nanosecond granularity they are
changing radically every millisecond; trying to figure out what is in them
at any point in time is not proving very easy...  debuggers are not
cooperating with me.  This code, and nanosecond granularity strikes me as
really odd.

The only thing I can think of is that their is a clock granularity/jitter
problem between the clock samples.  ATS code is sampling the time of day
clock, causing the re-enqueued priority queue items to suddenly get an
execution time stamp (that is supposed to be a hundred nanoseconds or so
in the past ­ the time I would estimate that it takes to process the
re-enquing of the priority queue items).  I suspect this time stamp might
have jitter in it causing items just pulled off the queue to go to sleep
waiting for their ³time² to come around.  This might be a CentOS problem,
it might be a VMWare clock problem, it might be a Mac clock problem...  it
might be a VMWare buffering the time of day clock problem (that seems the
most likely to me)....  One thought that comes to mind is that VMWare
might be the cause of all the lag in the form of buffered clocks and only
allowing ³wake up² style signals every 50ms or something strange like
that.  If it was a clock buffering problem, I would expect to see a wide
distribution of delays starting with ZERO and ranging up to whatever the
clock granularity problem is.   However, every request gets a minimum of a
10ms delay.  

There are bunches of timings between 11.9 and 12.3ms, and a bunch between
20.78 and 23.7 and then next batch is at 31 to 34 ms, etc... with clusters
between X2.0 ms and X4.0ms for each request, where X is a random tens
place digit.  This suggests a 10ms granularity in a timer or a reference
clock.  Maybe VMWare or CentOS only supports 10ms timer granularity when
something doesn't get immediately pulled off the queue?

Another thought I had was turning off chunking mode...  maybe ATS thinks
we are not done forming our responses or queries, and is buffering up
everything until a timeout tells it to go ahead and to dump ³incompletely
formed stream data² as chunk buffered I/O in the ³OK, X milliseconds have
passed, just dump your buffers because there is a delay getting the next
chunk out² paradigm?  Are these 10,20,30,40,50,60,70ms delays the products
of waiting for a chunk buffering timeout?  Turning off chunking mode in
records.config dropped the average latency to 36 milliseconds from 42.  I
would think it would make the problem go completely away or not do
anything at all.  Mystifying.

There are some spots in the code where we call out to TSContCall (contp,
TS_EVENT_VCONN_EOS, data); to tell it that a READ is done so that ATS
wakes up and continues.  That reduced an annoying latency from 80ms to
15ms.

I tried putting in:
    TSContCall(TSVIOContGet(mRequestVio), TS_EVENT_VCONN_WRITE_COMPLETE,
mRequestVio);
after one call to:
    mRequestVio = TSVConnWrite(mVc, httpCont, mServerRequestBufferReader,
content_length);
This didn't do anything at all to the latencies.  Maybe this is the wrong
call?

I'm rather stumped.  Does anyone there have ideas?  Is the answer ³ATS was
not designed to run on VM's²?

Thank you for reading this far,

James Whitaker



Mime
View raw message