hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@apache.org>
Subject Re: [DISCUSS] Re: Replication resiliency
Date Fri, 27 Jan 2017 21:03:33 GMT
Josh, probably worth checking if a grep or something else we can do in
precommit could catch this.

On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser <elserj@apache.org> wrote:
> Cool.
> Let me open an issue to scan the codebase to see if we can find any
> instances where we are starting threads which aren't using the UEH.
> Andrew Purtell wrote:
>> Agreed, let's abort with an abundance of caution. That is the _least_ that
>> should be done when a thread dies unexpectedly. Maybe we can improve
>> resiliency for specific cases later.
>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar<enis.soz@gmail.com>  wrote:
>>>> Do we have worker threads that we can't safely continue without
>>> indefinitely? Can we solve the general problem of "unhandled exception
>>> in threads cause a RS Abort"?
>>> We have this already in the code base. We are injecting an
>>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
>>> of the HRegionServer service threads (see HRS.startServiceThreads). But
>>> I've also seen cases where some important thread got unmarked. I think it
>>> is good idea to revisit that and make sure that all the threads are
>>> injected with the UEH.
>>> The replication source threads are started on demand, that is why the UEH
>>> is not injected I think. But agreed that we should do the safe route
>>> here,
>>> and abort the regionserver.
>>> Enis
>>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser<elserj@apache.org>  wrote:
>>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>>>> exception without a reasonable expectation of self-healing, tank the RS.
>>>> Having those threads die but not the RS could go uncaught for indefinite
>>>> period of time.
>>>> Sean Busbey wrote:
>>>>> I've noticed a few other places where we can lose a worker thread and
>>>>> the RegionServer happily continues. One notable example is the worker
>>>>> threads that handle syncs for the WAL. I'm generally a
>>>>> fail-fast-and-loud advocate, so I like aborting when things look
>>>>> wonky. I've also had to deal with a lot of pain around silent and thus
>>>>> hard to see replication failures, so strong signals that the
>>>>> replication system is in a bad way sound good to me atm.
>>>>> Do we have worker threads that we can't safely continue without
>>>>> indefinitely? Can we solve the general problem of "unhandled exception
>>>>> in threads cause a RS Abort"?
>>>>> As mentioned on the jira, I do worry a bit about cluster stability and
>>>>> cascading failures, given the ability to have user-provided endpoints
>>>>> in the replication process. Ultimately, I don't see that as different
>>>>> than all the other places coprocessors can put the cluster at risk.
>>>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<busbey@apache.org>
>>>>>> wrote:
>>>>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelmling@gmail.com>
>>>>>> wrote:
>>>>>>> Over in HBASE-17381 there has been some discussion around whether
>>>>>>> unhandled exception in a ReplicationSourceWorkerThread should
>>>>>>> a
>>>>>>> regionserver abort.
>>>>>>> The current behavior in the case of an unexpected exception in
>>>>>>> ReplicationSourceWorkerThread.run() is to log a message and simply
>>>>>>> let
>>>>>>> the
>>>>>>> thread die, allowing replication for this source to back up.
>>>>>>> I've seen this happen in an OOME scenario, which seems like a
>>>>>>> case
>>>>>>> where we would be better off aborting the regionserver.
>>>>>>> However, in the case of any other unexpected exceptions out of
>>>>>>> run()
>>>>>>> method, how do we want to handle this?
>>>>>>> I'm of the general opinion that where we would be better off
>>>>>>> on
>>>>>>> all unexpected exceptions, as it means that:
>>>>>>> a) we have some missing error handling
>>>>>>> b) failing fast raises visibility and makes it easier to add
>>>>>>> error
>>>>>>> handling that should be there
>>>>>>> c) silently stopping up replication creates problems that are
>>>>>>> difficult
>>>>>>> for
>>>>>>> our users to identify operationally and hard to troubleshoot.
>>>>>>> However, the current behavior has been there for quite a while,
>>>>>>> maybe
>>>>>>> there are other situations or concerns I'm not seeing which would
>>>>>>> justify
>>>>>>> having regionserver stability over replication stability.
>>>>>>> What are folks thoughts on this?  Should the regionserver abort
>>>>>>> all
>>>>>>> unexpected exceptions in the run method or should we more narrowly
>>>>>>> focus
>>>>>>> this on OOME's?

View raw message