hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/HBaseMeetingNotes" by PatrickHunt
Date Fri, 23 Oct 2009 18:59:07 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/HBaseMeetingNotes" page has been changed by PatrickHunt.
http://wiki.apache.org/hadoop/ZooKeeper/HBaseMeetingNotes

--------------------------------------------------

New page:
== Meeting notes - ZooKeeper use in HBase ==

=== 10/23/2009 phunt, mahadev, jdcryans, stack ===
Reviewed issues identified by HBase team. Also discussed how they might better troubleshoot
when problems occur. Mentioned that they (hbase) should feel free to ping the zk user list
if they have questions. JIRAs are great too if they have specific areas of ideas/concerns.

HBase team is going to create some JIRAs they identified (improve logging esp)

Phunt also asked the HBase team to fill in some usecases for how they are currently using
zk, this will allow us (zk) to better understand how they are using, allowing us to provide
better advice, vet their choices, and in some cases we may even test to verify under real
conditions.

Some questions identified:

 * Stack - impression is that zk is too flakey and sensitive at the moment in hbase deploys
(i'm talking particularly about the new-user experience).  I'd think a single server with
its Nk reads/second with fairly rare updates should be able to host a big hbase cluster with
a session timeout of 10/20 seconds but we are finding that we are recommending quorums of
3-5 on dedicated machines with sessions of 60 seconds because too often masters and regionservers
timeout against zk.  GC on RS can be a killer; pauses of tens of seconds.  CMS helps some.
  What is the experience of other users of zk?  Do you have recommendations for us regards
deploy?  When fellas are having trouble in the field with session timeouts, how do you debug/approach
this issue?

zk: biggest issues we typically see are; gc causing vm to stall (client or server, we have
no control over the jvm unfort, in hadoop i've heard they can see gc pauses of over 60 seconds),
server disk io causing stalls (sequential consistency means that reads have to wait for writes
to complete), network connectivity problems. Perhaps use of JVM tools such as gc monitoring
and jvisualvm may help to track down. Over-provising a host will obviously slow down processing
(running high cpu/disk processes on the same box(es) as zookeeper).

also see: https://issues.apache.org/jira/browse/ZOOKEEPER-545

 * hbase - We'd like to know what kind of limits a 3 nodes quorum have. What's the reasonable
amount of znodes, reads and writes per sec it can support? How many watches?

zk: this is the easiest to see http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html#Performance
however this is for optimum conditions (dedicated server class host with sufficient memory
and dedicated spindle for the log) and a large number of clients. With smaller number of clients
using synchronous operations you are probably limited more by network latency than anything.

 * hbase - How expensive is a WatchEvent?  What if we had a watcher per region hosted by a
regionserver?  Lets say up to 1k regions on a regionserver.  Thats 1k watchers.  Lets say
cluster of 1-200 regionservers.  Thats 200k watchers the zk cluster needs to control.

zk: very inexpensive by design. the intent was to support large numbers of watchers. I've
tested a single client creating 100k ephemeral nodes on session a, then session b setting
sync watches on all nodes, then session c sessing async watches on all nodes, then closing
session a. This was all done on a 1core 5yrs old laptop, 200k watches were delivered in <
10 sec. Granted that's all the zk cluster was doing, but it gives you some idea.

 * hbase - Would like to have queues up in zk.  Queues of regions moving through state changes
from unassigned to opening to open.  Would like to keep total list of all regions and then
list them by regionserver assigned to?  What would you suggest?


Background:

Below are some notes from our master rewrite design doc.  Its from the section where we talk
of moving all state and schemas to zk:

 * Tables are offlined, onlined, made read-only, and dropped (Add freeze of flushes and compactions
state to facilitate snapshotting). Currently HBase Master does this by messaging regionservers.
Instead move state to zookeeper. Let regionservers watch for changes and react. Allow that
a cluster may have up to 100 tables. Tables are made of regions. There may be thousands of
regions per table. A regionserver could be carrying a region from each of the 100 tables.
TODO: Should regionserver have a table watcher or a watcher per region?

 * Tables have schema. Tables are made of column families. Column families have schema/attributes.
Column families can be added and removed. Currently the schema is written into a column in
the .META. catalog family. Move all schema to zookeeper. Regionservers would have watchers
on schema and would react to changes. TODO: A watcher per column family or a watcher per table
or a watcher on the parent directory for schema?

 * Run region state transitions -- i.e. opening, closing -- by changing state in zookeeper
rather than in Master maps as is currently done.

 * Keep up a region transition trail; regions move through states from unassigned to opening
to open, etc. A region can't jump states as in going from unassigned to open.

 * Master (or client) moves regions between states. Watchers on RegionServers notice changes
and act on it. Master (or client) can do transitions in bulk; e.g. assign a regionserver 50
regions to open on startup. Effect is that Master "pushes" work out to regionservers rather
than wait on them to heartbeat.

 * A problem we have in current master is that states do not make a circle. Once a region
is open, master stops keeping account of a regions' state; region state is now kept out in
the .META. catalog table with its condition checked periodically by .META. table scan. State
spanning two systems currently makes for confusion and evil such as region double assignment
because there are race condition potholes as we move from one system -- internal state maps
in master -- to the other during update to state in .META. Current thinking is to keep region
lifecycle all up in zookeeper but that won't scale. Postulate 100k regions -- 100TB at 1G
regions -- each with two or three possible states each with watchers for state change. My
guess is that this is too much to put in zk. TODO: how to manage transition from zk to .META.?

 * State and Schema are distinct in zk. No interactions.

Mime
View raw message