hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/HBaseUseCases" by PatrickHunt
Date Mon, 16 Nov 2009 22:06:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/HBaseUseCases" page has been changed by PatrickHunt.
http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases?action=diff&rev1=11&rev2=12

--------------------------------------------------

  === Case 1 ===
  Summary: HBase Table State and Schema Changes
  
+ A table has a schema and state (online, read-only, etc.).  When we say thousands of RegionServers,
we're trying to give a sense of how many watchers we'll have on the znode that holds table
schemas and state.  When we say hundreds of tables, we're trying to give some sense of how
big the znode content will be... say 256 bytes of schema -- we'll only record difference from
default to minimize whats up in zk -- and then state I see as being something like zk's four-letter
words only they can be compounded in this case.  So, 100s of tables X 1024 schema X (2 four-letter
words each on average) at the outside makes for about a MB of data that thousands of regionservers
are watching.  That OK?
+ 
  Expected scale: Thousands of RegionServers watching ready to react to changes with about
100 tables each of which can have 1 or 2 states and an involved schema
  
+ [MS] I was thinking one znode of state and schema.  RegionServers would all have a watch
on it.  100s of tables means that a schema change on any table would trigger watches on 1000s
of RegionServers.  That might be OK though because any RegionServer could be carrying a Region
from the edited table.
- [PDH] the link is very useful, let's do the math here so that it's more clear, correct where
I get it wrong. So if I get this right:
-  * 100 tables
-  * each table may have 1000+ regions (let's say 1000 for this calculation)
-  * 1 region server will carry a region from each table (per the link)
-  * if I understand correctly, region servers don't "own" the region znode in the table,
just watch tables for regions it carries
  
- [MS] Number of regions is not pertinent here.  Whats relevant is that a table has a schema
and state (online, read-only, etc.).  When I say thousands of RegionServers, I'm trying to
give a sense of how many watchers we'll have on the znode that holds table schemas and state.
 When I say hundreds of tables, I'm trying to give some sense of how big the znode content
will be... say 256 bytes of schema -- we'll only record difference from default to minimize
whats up in zk -- and then state I see as being something like zk's four-letter words only
they can be compounded in this case.  So, 100s of tables X 1024 schema X (2 four-letter words
each on average) at the outside makes for about a MB of data that thousands of regionservers
are watching.  That OK?
+ [PDH] My original assumption was that each table has it's own znode (and would still be
my advice). In general you don't want to store very much data per znode - the reason being
that writes will slow (think of this -- client copies data to ZK server, which copies data
to ZK leader, which broadcasts data to all servers in the cluster, which then commit allowing
the original server to respond to the client). If the znode is changing infrequently, then
no big deal, but in general you don't want to do this. Also, the ZK server has a 1mb max data
size by default, so if you increase the number of tables (etc...) you will bump this at some
point.
  
+ [PDH Hence my original assumption, and suggestion. Consider having a znode per table, rather
than a single znode. It's more scalable and should be better in general. That's up to you
though - 1 znode will work too.
- [PDH] So this means a region server watches each of 100 tables. 
-  * 100 * 1000 = 100k watches, as each region server watching 100 table nodes
-  * watches typically fire as a group/table (ie on/off/ro/drop each table)
-   * 1000 watches would fire notifying 1000 region servers each time a table changes
  
- [MS] I was thinking one znode of state and schema.  RegionServers would all have a watch
on it.  100s of tables means that a schema change on any table would trigger watches on 1000s
of RegionServers.  That might be OK though because any RegionServer could be carrying a Region
from the edited table.
+ [PDH] A single table can change right? Not all the tables necessarily change state at the
same time? Then splitting into multiple znodes makes even more sense - you would only be changing
(writing) for things that change, even better is that the watchers will know the exact table
that changed rather than determining by reading the data and diffing.... I'm no expert on
hbase but from a typical ZK use case this is better. Obv this is a bit more complex than a
single znode, also there are more (separate) notifications that will fire instead of a single
one.... so you'd have to think through your use case (you could have a toplevel "state" znode
that brings down all the tables in the case where all the tables need to go down... then you
wouldn't have to change each table individually for this case (all tables down for whatever
reason).
  
  General recipe implemented: A better description of problem and sketch of the solution can
be found at [[http://wiki.apache.org/hadoop/Hbase/MasterRewrite#tablestate|Master Rewrite:
Table State]]
  
  [PDH] this is essentially "dynamic configuration" usecase - we are telling each region server
the state of the table containing a region it manages, when the master changes the state the
watchers are notified
  
  [MS] Is "dynamic configuration' usecase a zk usecase type described somewhere?
+ 
+ [PDH] What we have is [[http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_outOfTheBox|here]].
Not much to it really - both for name service and dynamic config you are creating znodes that
store relevant data, ZK clients can read/write/watch those nodes.
  
  === Case 2 ===
  Summary: HBase Region Transitions from unassigned to open and from open to unassigned with
some intermediate states
@@ -87, +84 @@

  
  [MS]  ZK will do the increment for us?  This looks good too.
  
+ [PDH] Right, the "increment" is using the SEQUENTIAL flag on create
+ 
  
  Any metadata stored for a region znode (ie to identify)? As long as size is small no problem.
(if a bit larger consider a /regions/<regionX>  znodes which has a list of all regions
and their identity (otw r/o data fine too)
  
@@ -105, +104 @@

  
  [MS] Really?  This sounds great Patrick.  Let me take a closer look.....  Excellent.
  
+ [PDH] Obv you need the hw (jvm heap esp, io bandwidth) to support it and the GC needs to
be tuned properly to reduce pausing (which cause timeout/expiration) but 100k is not that
much.
+ 
  
  See [[http://bit.ly/4ekN8G|this perf doc]] for some ideas, 20 clients doing 50k watches
each - 1 million watches on a single core standalone server and still << 5ms avg response
time (async ops, keep that in mind re implementation time) YMMV of course but your numbers
are well below this. 
  
@@ -121, +122 @@

  
  [MS] Excellent.
  
+ [PDH] think about potential other worst case scenarios, this is key to proper operation
of the system. Esp around "herd" effects and trying to minimize those.
+ 
  [PDH end]
  
  General recipe implemented: None yet.  Need help.  Was thinking of keeping queues up in
zk -- queues per regionserver for it to open/close etc.  But the list of all regions is kept
elsewhere currently and probably for the foreseeable future out in our .META. catalog table.
 Some further description can be found here [[http://wiki.apache.org/hadoop/Hbase/MasterRewrite#regionstate|Master
Rewrite: Region State]]

Mime
View raw message