hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: [DISCUSS] Re: Replication resiliency
Date Thu, 26 Jan 2017 22:19:32 GMT
+1 If any "worker" thread can't safely/reasonably retry some unexpected 
exception without a reasonable expectation of self-healing, tank the RS.

Having those threads die but not the RS could go uncaught for indefinite 
period of time.

Sean Busbey wrote:
> I've noticed a few other places where we can lose a worker thread and
> the RegionServer happily continues. One notable example is the worker
> threads that handle syncs for the WAL. I'm generally a
> fail-fast-and-loud advocate, so I like aborting when things look
> wonky. I've also had to deal with a lot of pain around silent and thus
> hard to see replication failures, so strong signals that the
> replication system is in a bad way sound good to me atm.
> Do we have worker threads that we can't safely continue without
> indefinitely? Can we solve the general problem of "unhandled exception
> in threads cause a RS Abort"?
> As mentioned on the jira, I do worry a bit about cluster stability and
> cascading failures, given the ability to have user-provided endpoints
> in the replication process. Ultimately, I don't see that as different
> than all the other places coprocessors can put the cluster at risk.
> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<busbey@apache.org>  wrote:
>> (edited subject to ensure folks filtering for DISCUSS see this)
>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelmling@gmail.com>  wrote:
>>> Over in HBASE-17381 there has been some discussion around whether an
>>> unhandled exception in a ReplicationSourceWorkerThread should trigger a
>>> regionserver abort.
>>> The current behavior in the case of an unexpected exception in
>>> ReplicationSourceWorkerThread.run() is to log a message and simply let the
>>> thread die, allowing replication for this source to back up.
>>> I've seen this happen in an OOME scenario, which seems like a clear case
>>> where we would be better off aborting the regionserver.
>>> However, in the case of any other unexpected exceptions out of the run()
>>> method, how do we want to handle this?
>>> I'm of the general opinion that where we would be better off aborting on
>>> all unexpected exceptions, as it means that:
>>> a) we have some missing error handling
>>> b) failing fast raises visibility and makes it easier to add any error
>>> handling that should be there
>>> c) silently stopping up replication creates problems that are difficult for
>>> our users to identify operationally and hard to troubleshoot.
>>> However, the current behavior has been there for quite a while, and maybe
>>> there are other situations or concerns I'm not seeing which would justify
>>> having regionserver stability over replication stability.
>>> What are folks thoughts on this?  Should the regionserver abort on all
>>> unexpected exceptions in the run method or should we more narrowly focus
>>> this on OOME's?

View raw message