hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/MasterRewrite" by Misty
Date Thu, 22 Oct 2015 05:21:07 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/MasterRewrite" page has been changed by Misty:
https://wiki.apache.org/hadoop/Hbase/MasterRewrite?action=diff&rev1=17&rev2=18

  
  '''MOST OF THE BELOW HAS BEEN DONE BY [[https://issues.apache.org/jira/browse/HBASE-2692|HBASE-2692]].
 OUTSTANDING IS MOVING SCHEMA UP TO ZK. -- St.Ack 20100901'''
  
- Initial Master Rewrite design came of conversations had at the hbase hackathon held at StumbleUpon,
August 5-7, 2009 ([[https://issues.apache.org/jira/secure/attachment/12418561/HBase+Hackathon+Notes+-+Sunday.pdf|Jon
Gray kept notes]]).  The umbrella issue for the master rewrite is [[https://issues.apache.org/jira/browse/HBASE-1816|HBASE-1816]].
 Timeline is hbase 0.21.0.
+ See also http://hbase.apache.org/book.html#_master.
  
- == Table of Contents ==
-  * [[#now|What does the Master do now?]]
-  * [[#problems|Problems with current Master]]
-  * [[#scope|Design Scope]]
-  * [[#design|Design]]
-   * [[#moveall|Move all state, state transitions, and schema to go via zookeeper]]
-    * [[#tablestate|Table State]]
-    * [[#regionstate|Region State]]
-    * [[#zklayout|Zookeeper layout]]
-   * [[#clean|Region State changes are clean, minimal, and comprehensive]]
-   * [[#balancer|Load Assignment/Balancer]]
-   * [[#root|Remove -ROOT-]]
-   * [[#root|Remove Heartbeat]]
-   * [[#root|Remove Safe Mode]]
-   * [[#intermediary|Further remove Master as necessary intermediary]]
-  * [[#misc|Miscellaneous]]
- 
- <<Anchor(now)>>
- 
- == What does the Master do now? ==
- Here's a bit of a refresher on what Master currently does:
- 
-  * Region Assignment
-   * On startup and balancing as regions are created and deleted as regions grow and split.
-  * Scan root/meta
-   * Make sure Regions are online
-   * Delete parents if no reference
-  * Manage schema alter/online/offline
-  * Admin
-   * Distributes out administered close, flush, compact messages
-  * Watches ZK for its own lease and for regionservers so knows when to run recovery
- 
- After implementation of this design, master will do all of above except manage schema and
distribute out messages to close, flush, etc.  Any client can do the later by manipulating
zk (we can add acl checks later).  Remaining master tasks will be less prone to error and
run snappier because no longer based on messaging carried atop periodic heartbeats from regionservers.
- 
- <<Anchor(problems)>>
- 
- == Problems with current Master ==
- There is a list in the [[https://issues.apache.org/jira/secure/ManageLinks.jspa?id=12434794|Issue
Links]] section of HBASE-1816. <<Anchor(scope)>>
- 
- == Design Scope ==
-  1. Rewrite of Master is for HBase 0.21
-  1. Design for:
-   1. A cluster of 1k regionservers.
-   1. Each regionserver carries 100 regions of 1G each (100k regions =~ 1-200TB)
- 
- <<Anchor(design)>>
- == Design ==
- 
- <<Anchor(moveall)>>
- === Move all state, state transitions, and schema to go via zookeeper ===
- 
- '''STATE TRANSITIONS VIA ZK IS DONE: See [[https://issues.apache.org/jira/browse/HBASE-2692|HBASE-2692]]'''
- Currently state transitions are done inside master shuffling between Maps triggered by messages
carried on the back of regionserver heartbeats.  Move all to zookeeper.
- 
- (Patrick Hunt and Mahadev have been helping with the below via [[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases|HBase
Zookeeper Use Cases]])
- 
- <<Anchor(tablestate)>>
- ==== Table State ====
- Tables are offlined, onlined, made read-only, and dropped (Add freeze of flushes and compactions
state to facilitate snapshotting).  Currently HBase Master does this by messaging regionservers.
 Instead move state to zookeeper.  Let regionservers watch for changes and react.  Allow that
a cluster may have up to 100 tables.  Tables are made of regions.  There may be thousands
of regions per table.  A regionserver could be carrying a region from each of the 100 tables.
- 
- Tables have schema.  Tables are made of column families.  Column families have schema/attributes.
 Column families can be added and removed.  Currently the schema is written into a column
in the .META. catalog family.  Move all schema to zookeeper.  Regionservers could have schema
watchers and react to schema changes.
- 
- ===== Design =====
- 
- In a tables directory up in zk, have a znode per table as per [[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases#case1|phunt's
suggestion]].  The znode will be named for the table.  In each table's znode keep state attributes
-- read-only, no-flush -- and the tables' schema (all in JSON).  Only carry the differences
from the default default up in zk to save on amount of data that needs to be passed.  Let
all regionservers watch all table znodes reacting if changes spinning through their list of
regions making reconciliation with current state of a tables' znode content.
- 
- 
- <<Anchor(zklayout)>>
- 
- ====== zk layout ======
- {{{
- /hbase/tables/table1 {JSON object would have state and schema objects, etc.  State is read-only,
offline, etc.  Schema has differences from default only}
- /hbase/tables/table2
- }}}
- 
- 
- ====== Other considerations? ======
- 
- A state per region?
- 
- Should we rather have just one file with all table schemas and states in it?  Easier to
deal with?  Patrick warns that it could bump the 1MB zk znode content limit and that it could
slow zk having to shuttle near-1MB of table schema+state on every little edit.
- 
- Patrick suggests that we have a alltables znode adjacent to the tables directory of znodes
and in here we'd have state for all tables.  This is state in two places so will leave aside
unless really needed.
- 
- <<Anchor(regionstate)>>
- ==== Region State ====
- 
- Run region state transitions -- i.e. ''opening'', ''closing'' -- by changing state in zookeeper
rather than in Master Maps as is currently done.
- 
- Keep up a region transition trail; regions move through states from ''unassigned'' to ''opening''
to ''open'', etc.  A region can't jump states as in going from ''unassigned'' to ''open''.
- 
- Part of transition involves moving a region under a regionserver.
- 
- Master (or client) moves regions between states.  Watchers on RegionServers notice changes
and act.  Master (or client) can do transitions in bulk; e.g. assign a regionserver 50 regions
to open on startup.  Effect is that Master "pushes" work out to regionservers rather than
wait on them to heartbeat, the way we currently assign regions.
- 
- A problem we have in current system (<= hbase 0.20.x) is that states do not make a circle.
 Once a region is open, master stops keeping account of a regions' state; region state is
now kept out in the .META. catalog table with its condition checked periodically by .META.
table scan.  State spanning two systems currently makes for confusion and evil such as region
double assignment because there are race condition potholes as we move from one system --
internal state maps in master -- to the other during update to state in .META.
- 
- Current thinking is to keep region lifecycle all up in zookeeper but that won't scale. 
Postulate 100k regions -- 100TB at 1G regions -- each with two or three possible states each
with watchers for state change.  My guess is that this is too much to put in zk (Mahadev+Patrick
say no if data is small).  TODO: how to manage transition from zk to .META.?  Also, can't
do getClosest up in zk, only in .META.
- 
- ===== Design =====
- Here is [[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases#case2|Patrick's suggestion]].
 We already keep a znode per regionserver though its named for the regionservers startcode
-- see the 'rs' directory in 0.20.x zookeepers.  On evaporation of the regionserver ephemeral
node, master would run a reconciliation (or on assumption of master roll, new master would
check state in zk making sure a regionserver per region, etc.).
- 
- All regions would be listed in .META. table always.  Whether they are online, splitting
or closing, etc., would be up in zk.  So, figuring if something is unassigned would be case
of a .META. table scan.  Anything not managed by zk, needs to be added in there (assigned).
- 
- ====== zk layout ======
- Here is some cleanup of [[http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases#case2|Patrick's
suggestion]]
- 
- {{{
- # First, redo the current 'rs' directory slightly:
- /hbase/regionservers # master watches /regionservers for any child changes
- /hbase/regionserver/<host:port:startcode> = <status> # As each region server
becomes available to do work (or track state if up but not avail) it creates an ephemeral
node; writes state (up/down).
- # Master watches all /regionserver/<host:port:startcode> and cleans up if RS goes
away or changes status
- 
- # Now, for regions
- /hbase/regions/<regionserver by host:port:startcode> # Gets created when master notices
new region server
- # RS host:port watches this node for any child changes 
- 
- /hbase/regions/<regionserver by host:port:startcode>/<regionXYZ> # znode for
each region assigned to RS host:port.
- # RS host:port watches this node in case reassigned by master, or region changes state 
- 
- /tables/<regionserver by host:port:startcode>/<regionXYZ>/<state>-<seq#>
# znode created by master
- # seq ensures order seen by RS
- # RS deletes old state znodes as it transitions out, oldest entry is the current state,
always 1 or more znode here -- the current state 
- }}}
- 
- ====== Outstanding ======
- 
- In current o.a.h.h.master.RegionManager.RegionState inner class, here are possible states:
- {{{
- private volatile boolean unassigned = false;
- private volatile boolean pendingOpen = false;
- private volatile boolean open = false;
- private volatile boolean closing = false;
- private volatile boolean pendingClose = false;
- private volatile boolean closed = false;
- private volatile boolean offlined = false;
- }}}
- 
- Are all needed?  What would we add?
- 
- <<Anchor(clean)>>
- 
- === Region State changes are clean, minimal, and comprehensive ===
- Currently, moving a region from ''opening'' to ''open'' may involve a region compaction
-- i.e. a change to content in filesystem.  Better if modification of filesystem content was
done when no question of ownership involved.
- 
- <<Anchor(balancer)>>
- 
- === Region Assignment/Balancer ===
- Make it so don't need to put up a cluster to test balancing algorithm.
- 
- Assignment / balancing
- 
-  * RS publish load into ZK
-   * /hbase/rsload/STARTCODE/load({‘json’:’load’})
-   * Configure period it is refreshed
-  * Assignment inputs
-   * Load
-   * Requests / sec
-   * regions online
-  * Distribute tables randomly across cluster
-   * Never give table back to who unassigned it
-   * During split, bottom half to the same server, top half reassigned
-  * Assignment Queue
-   * Candidate Queue
-    * Master watches ZK candidate queue /hbase/rsassign/region(last_owned_by_rs)
-    * When new nodes come in, it assigns the out Regionservers put regions into the candidate
queue when they unassign/close
-  * To Open Queue
-   * Regionservers watch their own to open queues /hbase/rsopen/region(extra_info, which
hlogs to replay or it’s a split, etc)
- 
- Safe-mode assignment
- 
-  * Collect all regions to assign
-  * Randomize and assign out in bulk, one msg per RS
-  * Region assignment is always
-   * Look at all regions to be assigned
-   * Make a single decision for the assignment of all of these
- 
- <<Anchor(root)>>
- 
- === Remove -ROOT- ===
- Remove -ROOT- from filesystem; have it only live up in zk (Possible now Region Historian
feature has been removed).
- 
- <<Anchor(heartbeat)>>
- 
- === Remove Heartbeat ===
- We don't need RegionServers pinging master every 3 seconds if zookeeper is intermediary.
- 
- <<Anchor(safemode)>>
- 
- === Remove Safe Mode ===
- Safe mode is broke.  It doesn't do anything.  Remove it.
- 
- <<Anchor(intermediary)>>
- 
- === Further remove Master as necessary intermediary ===
- Clients do not need to go via master administrating tables, changing schema or sending flush/compaction
commands, etc.  Clients should be able to write direct to regionserver or to zk.
- 
- <<Anchor(misc)>>
- 
- == Miscellaneous ==
- 

Mime
View raw message