jmeter-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gil Tene <...@azulsystems.com>
Subject Re: Coordinated Omission (CO) - possible strategies
Date Sat, 19 Oct 2013 07:56:29 GMT
To focus on the "how to deal with Coordinated Omission" part:

There are two main ways to deal with CO in your actual executed behavior:

1. Change the behavior to avoid CO to begin with.

2. Detect it and correct it.

There is a "detect it and report it" one too, but I dot think it is of any real use, as detection
without correction will just tell you your data can't be believed at all, but won't tell you
anything about what can be. Since CO can move percentile magnitudes and position by literal
multiple orders if magnitude (I have multiple measured real world production behaviors that
show this) , "hoping it us not too bad" when you know it is there amounts to burying your
head in the sand.

Avoiding CO [option 1] is obviously preferable where possible. E.g. In load generators this
can be achieved if everything the load generator does is made asynchronous, or by making sure
that any synchronous part will never attempt to send messages closer together in time than
the largest possible stall the system under test may ever experience (with some extra padding,
this means "no closer than 10 minutes apart").

But avoiding CO in your actual measured results is unfortunately impractical for many systems.
E.g. In systems where actual individual clients interact with the system using in-order transports
(like TCP) with actual inter-request time gaps that are shorter than stalls that occur in
the system CO will absolutely incur, both in the real world and in any tester that emulates
it.

Correcting CO [option 2] is what you have to do if CO exists in the data measured by actual-executed-stuff.
Correction inevitably amounts to "filling in the gaps" by projecting (without certainty or
actual knowledge) a modeled behavior onto those gaps and adding data points to the data set
that did nit actually get measured, but "would have" had COZ not stopped the measurements
from being taken at the right points. There are various ways to correct CO in such data sets,
and how well they do depends on how much we know about the behavior of the system around the
gaps and how much we know about the the themselves (e.g. Knowing an actual complete stall
occurred us very useful).

I think JMeter falls squarely into the synchronous tester camp, and that's not going to change.
Given that many (most?) systems it measures use TCP as a transport and naturally exhibit systems
stalls that are longer than inter-request times in actual use behaviors, I see eliminating
CO from JMeter's actual measured results as hopeless. Coordinate Omission in JMeter is just
part if life, and we have to deal with it. I therefore focus on the "how to correct" part
if the equation.

Having played with correction techniques, I can say that random operation sequences (not random
timing) is the hardest thing to deal with. Not necessarily impossible, but really hard. Random
timing, on the other hand is easily dealt with for correction purposes, as projecting known,
non-random sequences of operations into the CO gaps can be done just as well based in averaged
timing data.

So Kirk, is the random behavior you need one if random timing, or random operation sequencing
(or both)?

Sent from my iPad

On Oct 18, 2013, at 10:48 PM, "Kirk Pepperdine" <kirk.pepperdine@gmail.com<mailto:kirk.pepperdine@gmail.com>>
wrote:


On 2013-10-19, at 1:33 AM, Gil Tene <gil@azulsystems.com<mailto:gil@azulsystems.com>>
wrote:

I guess we look at human response back pressure in different ways. It's a question of whether
or not you consider the humans to be part of the system you are testing, and what you think
your stats are supposed to represent.

You've seen my presentations and so you know that I do believe that human and non-human actors
are definitively part of the system. They provide the dynamics for the system being tested.
A change in how that layer in my model works can and does makes a huge difference in how the
other layers work to support the overall system.

Some people will take the "forgiving" approach, which considers the client behavior client
as part of the overall system behavior. In such an approach, if a human responded to slow
behavior by not asking any more questions for a while, that's simply what the overall system
did, and the stats reported should reflect only the actual attempts that actual humans would
have, including their slowing down their requests in response to slow reaction times.

Sort of. I want to know that a user was inhibited from making forward progress because the
previous step in their workflow blew stated tolerances. In some cases I'd like to have that
user abandon. I'm not sure I'd call this forgiving though I am looking to see what the overall
system can do to answer the question; is it good enough and if not, why not.

I'm not going to suggest your view is incorrect. I think it's quite valid. I don't believe
the two views are orthogonal and that there are elements of both in each. The question here
on more practical terms is; what needs to be done to reduce the level of CO that currently
occurs in JMeter and how should we react to it. Throwing out entire datasets from runs seems
like an academic answer to a more practical question; will our application stand up when under
load. From my point of view, for JMeter to better answer that question.


A web site being completely down for 5 minutes an hour would generate a lot of human back
pressure response. It may even slow down request rates so much during the outage that 99%+
of the overall actual requests by end users during an hour that included such a 5 minute outage
would still be very good. Reporting on those (actual requests by humans) would be very different
from reporting on what would have happened without human back pressure. But it's easy to examine
which of the two reporting methods would be accepted by a reader of such reports.

But then that 5 minute outage is going to show up some where and if you bury it in how you
report.... that would seem to be a problem. This whole argument suggests that what you want
is a better regime for the treatment of the data. If that is what you're saying, we're in
complete agreement. The 5 minute pause should not be filtered out of the data!

IMHO, the first thing to do is eliminate or reduce the known sources of CO from JMeter. I'm
not sure that tackling the CTT is the beat way to go. In fact I'd prefer a combination of
approaches that includes things like how jHiccup works with a GC STW detector. As you've mentioned
before, even with a fix to the threading model in JMeter, CO will still occur.

Regards,
Kirk


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message