drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Phillips <ste...@dremio.com>
Subject Re: Failure Behavior
Date Tue, 29 Mar 2016 03:08:40 GMT
If a fragment has already begun execution and sent some data to downstream
fragments, there is no way to simply restart the failed fragment, because
we would also have to restart any downstream fragments that consumed that
output, and so on up the tree, as well as restart any leaf fragments that
fed into any of those fragments. This is because we don't store
intermediate results to disk.

The case where I think it would even be possible would be if a node died
before sending any data downstream. But I think the only way to be sure of
this would be to poll all of the downstream fragments and verify that no
data from the failed fragment was ever received. I think this would add a
lot of complication and overhead to Drill.

On Sat, Mar 26, 2016 at 10:03 AM, John Omernik <john@omernik.com> wrote:

> Thanks for the responses.. So, even if the drillbit that died wasn't the
> foreman the query would fail? Interesting... Is there any mechanism for
> reassigning fragments? *try harder* so to speak?  I guess does this play
> out too if I have a query and say something on that node caused a fragment
> to fail, that it could be tried somewhere else... So I am not trying to
> recreate map reduce in Drill (although I am sorta asking about similar
> features), but in a distributed environment, what is the cost to allow the
> foremen to time out a fragment and try again elsewhere. Say there was a
> heart beat sent back from the bits running a fragment, and if the heartbeat
> and lack of results exceeded 10 seconds, have the foremen try again
> somewhere else (up to X times configured by a setting).  I am just curious
> here for my own knowledge what makes that hard in a system like Drill.
>
> On Sat, Mar 26, 2016 at 10:47 AM, Abdel Hakim Deneche <
> adeneche@maprtech.com
> > wrote:
>
> > the query could succeed is if all fragments that were running on the
> > now-dead node already finished. Other than that, the query fails.
> >
> > On Sat, Mar 26, 2016 at 4:45 PM, Neeraja Rentachintala <
> > nrentachintala@maprtech.com> wrote:
> >
> > > As far as I know, there is no failure handling in Drill. The query
> dies.
> > >
> > > On Sat, Mar 26, 2016 at 7:52 AM, John Omernik <john@omernik.com>
> wrote:
> > >
> > > > With distributed Drill, what is the expected/desired bit failure
> > > behavior.
> > > > I.e. if you are running, and certain fragments end up on a node with
> a
> > > bit
> > > > in a flaky state (or a bit that suddenly dies).  What is the desired
> > and
> > > > actual behavior of the query? I am guessing that if the bit was
> > foreman,
> > > > the query dies, I guess that's unavoidable, but if it's just a
> worker,
> > > does
> > > > the foreman detect this and reschedule the fragment or does the query
> > die
> > > > any way?
> > > >
> > > > John
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message