hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Kyle Purtell (Jira)" <j...@apache.org>
Subject [jira] [Created] (HBASE-25212) Optionally abort requests in progress after deciding a region should close
Date Thu, 22 Oct 2020 00:45:00 GMT
Andrew Kyle Purtell created HBASE-25212:

             Summary: Optionally abort requests in progress after deciding a region should
                 Key: HBASE-25212
                 URL: https://issues.apache.org/jira/browse/HBASE-25212
             Project: HBase
          Issue Type: Improvement
          Components: regionserver
            Reporter: Andrew Kyle Purtell
            Assignee: Andrew Kyle Purtell
             Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0

After deciding a region should be closed, the regionserver will set the internal region state
to closing and wait for all pending requests to complete, via a rendezvous on the region lock.
In closing state the region will not accept any new requests but requests in progress will
be allowed to complete before the close action takes place. In our production we see outlier
wait times on this lock in excess of several minutes. 

During close when there are requests in flight the regionserver is subject to any conceivable
reason for delay, like full scans over large regions, expensive filtering hierarchies, bugs,
or store level performance problems like slow HDFS. The regionserver should interrupt requests
in progress to facilitate smaller/shorter close times on an opt-in basis.

Optionally, via configuration parameter -- which would be a system wide default set in hbase-site.xml
in common practice but could be overridden in table schema for per table settings -- interrupt
requests in progress holding the region lock rather than wait for completion of all operations
in flight. Send back NotServingRegionException("region is closing") to the clients of the
interrupted operations, like we do after the write lock is acquired. The client will transparently
relocate the region data and resubmit the aborted requests per normal retry policy. This can
be less disruptive than waiting for very long times for a region to close in extreme outlier
cases (e.g. 50 minutes).

After waiting for all requests to complete then we flush the region's memstore and finish
the close. The flush portion of the close process is out of scope of this proposal. Under
normal conditions the flush portion of the close completes quickly. It is specifically waits
on the close lock that has been an occasional issue in our production that causes difficulty
achieving 99.99% availability.

This message was sent by Atlassian Jira

View raw message