incubator-drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Constantine Peresypkin <pconstant...@gmail.com>
Subject Re: Jeff Dean on fast response in an unreliable world
Date Wed, 12 Sep 2012 18:47:25 GMT
> The PowerDrill paper also mentions a variant of this where each query
fragment is sent to two machines, and the results for that fragment are
used from whatever machine responds first.


To send each query or request twice cluster load will be increased by 100%.
In his paper Jeff talks about sending backup request only after a certain
delay, in his first example the delay is around 10 ms.
You can rather easily get it from the data presented in the paper: 95% of
requests finish in 24 ms or less. This means that (if the distribution was
normal) there would be at least 5 stdev intervals inside 95 percentile.
Thus 24 / 5 = 4.8 this is the "normal" stdev interval. This makes it
feasible to delay at least 4.8 ms to see something meaningful or better do
2 * 4.8 = 9.6 ms which obviously reduces number of backup requests to only
5% of all requests.
The most interesting thing comes after that: when they applied the 10 ms
delay to backup requests they got to 14 ms average with 4 ms stdev and 20
ms for 95 percentile, this is almost exactly the theoretical numbers from
the calculations above.
Which means that almost certainly nobody can do better than that, even
after resending every request twice.

Anyway, the paper deals with "standard" web, what Drill have to achieve is
something different.
The main difference comes from the nature of web requests: they are all the
same, even in a real world scenario the workers are just clones of each
other, they do exactly the same logic they produce exactly the same result.
In the real-time query scenario workers are not the same, they are
different executables, they query different parts of the dataset, they use
different logic, the whole point of Drill is to give this capability to the
user, make it possible to query any data by any executable.
Here it's not always feasible to restart a part of the query, maybe it's
better to restart it as a whole, here the logic can change with each
request, you cannot run the same logic each time and thus get deterministic
results, here some effort must be made to make sure that any logic can be
"bent" into a deterministic one, and then rolled back or applied in a
deterministic matter, otherwise you will never get out of this mess in time.

On Wed, Sep 12, 2012 at 4:54 AM, Jason Frantz <jfrantz@maprtech.com> wrote:

> Definitely agree with many of the points in the link.
>
> The PowerDrill paper also mentions a variant of this where each query
> fragment is sent to two machines, and the results for that fragment are
> used from whatever machine responds first. So in that case it's not so much
> a "cancel" as an "ignore".
>
> On Tue, Sep 11, 2012 at 11:37 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Headed into Thursday's meetup, this paper by Jeff Dean provides a very
> good
> > description of strategies for getting fast response times with variable
> > quality infrastructure.
> >
> > http://research.google.com/people/jeff/latency.html
> >
> > The key point here is that it is very important to have asynchronous
> > queries with a cancel.  Above that level, there needs to be a simple
> > strategy for pushing second versions of queries out to the workers and
> > canceling defunct or redundant queries.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message