hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-1457) Taking down ROOT/META regionserver can result in cluster becoming in-operational
Date Sun, 31 May 2009 08:42:07 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

ryan rawson updated HBASE-1457:
-------------------------------

    Attachment: HBASE-1457-v4.patch

the latest fix, including:
- make region historian writes into todo queue
- make todo queue a priority queue, putting higher priority items to the top
- ensure double assignment of ROOT/META can't happen
- prevent assignment bugs when the cluster is mis-loaded, and ensure ROOT/META get assigned
as fast as possible to the first server (rather than the best server as was previously)
-- assignment could get stuck when the 'best' server was unable to contact the master because
the ROOT/META is offline. Very ugly bug.
- reduce how much we retry in pending operations, this can delay recovery because if the META/ROOT
goes down while processing a TODO, the recovery of the META/ROOT has to wait until the currently
running pending operation times out. This could take over 5 minutes previously (!!).  1 second
time outs * 10 * 2-3 per commit() * 2 attempts takes a long time.
- improve a bug where if ROOT was unavailable some pending operations might fail and not get
requeued.
- Handle bugs where a server would go offline and 'forget' to mention that ROOT or META went
offline, thus delaying reassignment.  Now we force META/ROOT offline ASAP and get them reassigned
as fast as possible on clean shutdown.
- Improved unclean shutdown handling of META - instead of waiting for the ROOT scanner to
detect a bad assignment and fix it, be more proactive and put the META to be assigned once
log split is finished.  This can improve META recovery time by 5-10 seconds.
- Fixed a rare but deadly NPE in ProcessRegionOpen, improved the handling of failed todo operations
- instead of putting them back into the todo queue, put them into the delayed queue (since
the NPE is a side effect of not having ROOT online yet).

Yes, All these bugs are incorporated in this relatively small patch. (933 lines of diff) 



> Taking down ROOT/META regionserver can result in cluster becoming in-operational
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-1457
>                 URL: https://issues.apache.org/jira/browse/HBASE-1457
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457-v4.patch, HBASE-1457.patch
>
>
> Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't
properly reassign the root/meta regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message