hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <andrew.purt...@gmail.com>
Subject Re: [DISCUSS] Re: Replication resiliency
Date Fri, 27 Jan 2017 21:13:33 GMT
There is an old JIRA somewhere to use Error Prone (https://github.com/google/error-prone) as
framework for implementing static code analysis checks like that. FWIW

> On Jan 27, 2017, at 1:03 PM, Sean Busbey <busbey@apache.org> wrote:
> Josh, probably worth checking if a grep or something else we can do in
> precommit could catch this.
>> On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser <elserj@apache.org> wrote:
>> Cool.
>> Let me open an issue to scan the codebase to see if we can find any
>> instances where we are starting threads which aren't using the UEH.
>> Andrew Purtell wrote:
>>> Agreed, let's abort with an abundance of caution. That is the _least_ that
>>> should be done when a thread dies unexpectedly. Maybe we can improve
>>> resiliency for specific cases later.
>>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar<enis.soz@gmail.com>  wrote:
>>>>> Do we have worker threads that we can't safely continue without
>>>> indefinitely? Can we solve the general problem of "unhandled exception
>>>> in threads cause a RS Abort"?
>>>> We have this already in the code base. We are injecting an
>>>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
>>>> of the HRegionServer service threads (see HRS.startServiceThreads). But
>>>> I've also seen cases where some important thread got unmarked. I think it
>>>> is good idea to revisit that and make sure that all the threads are
>>>> injected with the UEH.
>>>> The replication source threads are started on demand, that is why the UEH
>>>> is not injected I think. But agreed that we should do the safe route
>>>> here,
>>>> and abort the regionserver.
>>>> Enis
>>>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser<elserj@apache.org>
>>>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>>>>> exception without a reasonable expectation of self-healing, tank the
>>>>> Having those threads die but not the RS could go uncaught for indefinite
>>>>> period of time.
>>>>> Sean Busbey wrote:
>>>>>> I've noticed a few other places where we can lose a worker thread
>>>>>> the RegionServer happily continues. One notable example is the worker
>>>>>> threads that handle syncs for the WAL. I'm generally a
>>>>>> fail-fast-and-loud advocate, so I like aborting when things look
>>>>>> wonky. I've also had to deal with a lot of pain around silent and
>>>>>> hard to see replication failures, so strong signals that the
>>>>>> replication system is in a bad way sound good to me atm.
>>>>>> Do we have worker threads that we can't safely continue without
>>>>>> indefinitely? Can we solve the general problem of "unhandled exception
>>>>>> in threads cause a RS Abort"?
>>>>>> As mentioned on the jira, I do worry a bit about cluster stability
>>>>>> cascading failures, given the ability to have user-provided endpoints
>>>>>> in the replication process. Ultimately, I don't see that as different
>>>>>> than all the other places coprocessors can put the cluster at risk.
>>>>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<busbey@apache.org>
>>>>>>> wrote:
>>>>>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelmling@gmail.com>
>>>>>>> wrote:
>>>>>>>> Over in HBASE-17381 there has been some discussion around
whether an
>>>>>>>> unhandled exception in a ReplicationSourceWorkerThread should
>>>>>>>> a
>>>>>>>> regionserver abort.
>>>>>>>> The current behavior in the case of an unexpected exception
>>>>>>>> ReplicationSourceWorkerThread.run() is to log a message and
>>>>>>>> let
>>>>>>>> the
>>>>>>>> thread die, allowing replication for this source to back
>>>>>>>> I've seen this happen in an OOME scenario, which seems like
a clear
>>>>>>>> case
>>>>>>>> where we would be better off aborting the regionserver.
>>>>>>>> However, in the case of any other unexpected exceptions out
of the
>>>>>>>> run()
>>>>>>>> method, how do we want to handle this?
>>>>>>>> I'm of the general opinion that where we would be better
off aborting
>>>>>>>> on
>>>>>>>> all unexpected exceptions, as it means that:
>>>>>>>> a) we have some missing error handling
>>>>>>>> b) failing fast raises visibility and makes it easier to
add any
>>>>>>>> error
>>>>>>>> handling that should be there
>>>>>>>> c) silently stopping up replication creates problems that
>>>>>>>> difficult
>>>>>>>> for
>>>>>>>> our users to identify operationally and hard to troubleshoot.
>>>>>>>> However, the current behavior has been there for quite a
while, and
>>>>>>>> maybe
>>>>>>>> there are other situations or concerns I'm not seeing which
>>>>>>>> justify
>>>>>>>> having regionserver stability over replication stability.
>>>>>>>> What are folks thoughts on this?  Should the regionserver
abort on
>>>>>>>> all
>>>>>>>> unexpected exceptions in the run method or should we more
>>>>>>>> focus
>>>>>>>> this on OOME's?

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message