Return-Path: X-Original-To: apmail-hbase-commits-archive@www.apache.org Delivered-To: apmail-hbase-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A7BA6DC03 for ; Sun, 23 Sep 2012 22:02:05 +0000 (UTC) Received: (qmail 89655 invoked by uid 500); 23 Sep 2012 22:02:05 -0000 Delivered-To: apmail-hbase-commits-archive@hbase.apache.org Received: (qmail 89617 invoked by uid 500); 23 Sep 2012 22:02:05 -0000 Mailing-List: contact commits-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list commits@hbase.apache.org Received: (qmail 89610 invoked by uid 99); 23 Sep 2012 22:02:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Sep 2012 22:02:05 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Sep 2012 22:02:01 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id ED20C23889D7 for ; Sun, 23 Sep 2012 22:01:17 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1389153 - in /hbase/trunk/src/docbkx: book.xml configuration.xml getting_started.xml performance.xml zookeeper.xml Date: Sun, 23 Sep 2012 22:01:17 -0000 To: commits@hbase.apache.org From: stack@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20120923220117.ED20C23889D7@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: stack Date: Sun Sep 23 22:01:16 2012 New Revision: 1389153 URL: http://svn.apache.org/viewvc?rev=1389153&view=rev Log: More edits: Moved ZK to its own chapter, put the bloom filter stuff together in one place, made the distributed setup more focused Added: hbase/trunk/src/docbkx/zookeeper.xml Modified: hbase/trunk/src/docbkx/book.xml hbase/trunk/src/docbkx/configuration.xml hbase/trunk/src/docbkx/getting_started.xml hbase/trunk/src/docbkx/performance.xml Modified: hbase/trunk/src/docbkx/book.xml URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/book.xml?rev=1389153&r1=1389152&r2=1389153&view=diff ============================================================================== --- hbase/trunk/src/docbkx/book.xml (original) +++ hbase/trunk/src/docbkx/book.xml Sun Sep 23 22:01:16 2012 @@ -2318,65 +2318,6 @@ myHtd.setValue(HTableDescriptor.SPLIT_PO - -
- Bloom Filters - Bloom filters were developed over in HBase-1200 - Add bloomfilters. - For description of the development process -- why static blooms - rather than dynamic -- and for an overview of the unique properties - that pertain to blooms in HBase, as well as possible future - directions, see the Development Process section - of the document BloomFilters - in HBase attached to HBase-1200. - - The bloom filters described here are actually version two of - blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom - option based on work done by the European Commission One-Lab - Project 034819. The core of the HBase bloom work was later - pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. - Version 1 of HBase blooms never worked that well. Version 2 is a - rewrite from scratch though again it starts with the one-lab - work. - - See also and . - - -
- Bloom StoreFile footprint - - Bloom filters add an entry to the StoreFile - general FileInfo data structure and then two - extra entries to the StoreFile metadata - section. - -
- BloomFilter in the <classname>StoreFile</classname> - <classname>FileInfo</classname> data structure - - FileInfo has a - BLOOM_FILTER_TYPE entry which is set to - NONE, ROW or - ROWCOL. -
- -
- BloomFilter entries in <classname>StoreFile</classname> - metadata - - BLOOM_FILTER_META holds Bloom Size, Hash - Function used, etc. Its small in size and is cached on - StoreFile.Reader load - BLOOM_FILTER_DATA is the actual bloomfilter - data. Obtained on-demand. Stored in the LRU cache, if it is enabled - (Its enabled by default). -
-
-
@@ -2519,6 +2460,7 @@ myHtd.setValue(HTableDescriptor.SPLIT_PO + Modified: hbase/trunk/src/docbkx/configuration.xml URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/configuration.xml?rev=1389153&r1=1389152&r2=1389153&view=diff ============================================================================== --- hbase/trunk/src/docbkx/configuration.xml (original) +++ hbase/trunk/src/docbkx/configuration.xml Sun Sep 23 22:01:16 2012 @@ -27,8 +27,10 @@ */ --> Configuration - This chapter is the Not-So-Quick start guide to HBase configuration. - Please read this chapter carefully and ensure that all requirements have + This chapter is the Not-So-Quick start guide to HBase configuration. It goes + over system requirements, Hadoop setup, the different HBase run modes, and the + various configurations in HBase. Please read this chapter carefully and ensure + that all have been satisfied. Failure to do so will cause you (and us) grief debugging strange errors and/or data loss. @@ -56,6 +58,10 @@ to ensure well-formedness of your docume all nodes of the cluster. HBase will not do this for you. Use rsync. +
+ Basic Requirements + This section lists required services and some required system configuration. +
Java @@ -237,7 +243,6 @@ to ensure well-formedness of your docume Currently only Hadoop versions 0.20.205.x or any release in excess of this version -- this includes hadoop 1.0.0 -- have a working, durable sync - On Hadoop Versions The Cloudera blog post An update on Apache Hadoop 1.0 by Charles Zedlweski has a nice exposition on how all the Hadoop versions relate. Its worth checking out if you are having trouble making sense of the @@ -352,6 +357,7 @@ to ensure well-formedness of your docume
+
HBase run modes: Standalone and Distributed @@ -686,565 +692,6 @@ stopping hbase...............
-
- ZooKeeper<indexterm> - <primary>ZooKeeper</primary> - </indexterm> - - A distributed HBase depends on a running ZooKeeper cluster. - All participating nodes and clients need to be able to access the - running ZooKeeper ensemble. HBase by default manages a ZooKeeper - "cluster" for you. It will start and stop the ZooKeeper ensemble - as part of the HBase start/stop process. You can also manage the - ZooKeeper ensemble independent of HBase and just point HBase at - the cluster it should use. To toggle HBase management of - ZooKeeper, use the HBASE_MANAGES_ZK variable in - conf/hbase-env.sh. This variable, which - defaults to true, tells HBase whether to - start/stop the ZooKeeper ensemble servers as part of HBase - start/stop. - - When HBase manages the ZooKeeper ensemble, you can specify - ZooKeeper configuration using its native - zoo.cfg file, or, the easier option is to - just specify ZooKeeper options directly in - conf/hbase-site.xml. A ZooKeeper - configuration option can be set as a property in the HBase - hbase-site.xml XML configuration file by - prefacing the ZooKeeper option name with - hbase.zookeeper.property. For example, the - clientPort setting in ZooKeeper can be changed - by setting the - hbase.zookeeper.property.clientPort property. - For all default values used by HBase, including ZooKeeper - configuration, see . Look for the - hbase.zookeeper.property prefix - For the full list of ZooKeeper configurations, see - ZooKeeper's zoo.cfg. HBase does not ship - with a zoo.cfg so you will need to browse - the conf directory in an appropriate - ZooKeeper download. - - - You must at least list the ensemble servers in - hbase-site.xml using the - hbase.zookeeper.quorum property. This property - defaults to a single ensemble member at - localhost which is not suitable for a fully - distributed HBase. (It binds to the local machine only and remote - clients will not be able to connect). - How many ZooKeepers should I run? - - You can run a ZooKeeper ensemble that comprises 1 node - only but in production it is recommended that you run a - ZooKeeper ensemble of 3, 5 or 7 machines; the more members an - ensemble has, the more tolerant the ensemble is of host - failures. Also, run an odd number of machines. In ZooKeeper, - an even number of peers is supported, but it is normally not used - because an even sized ensemble requires, proportionally, more peers - to form a quorum than an odd sized ensemble requires. For example, an - ensemble with 4 peers requires 3 to form a quorum, while an ensemble with - 5 also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to - fail, and thus is more fault tolerant than the ensemble of 4, which allows - only 1 down peer. - - Give each ZooKeeper server around 1GB of RAM, and if possible, its own - dedicated disk (A dedicated disk is the best thing you can do - to ensure a performant ZooKeeper ensemble). For very heavily - loaded clusters, run ZooKeeper servers on separate machines - from RegionServers (DataNodes and TaskTrackers). - - - For example, to have HBase manage a ZooKeeper quorum on - nodes rs{1,2,3,4,5}.example.com, bound to - port 2222 (the default is 2181) ensure - HBASE_MANAGE_ZK is commented out or set to - true in conf/hbase-env.sh - and then edit conf/hbase-site.xml and set - hbase.zookeeper.property.clientPort and - hbase.zookeeper.quorum. You should also set - hbase.zookeeper.property.dataDir to other than - the default as the default has ZooKeeper persist data under - /tmp which is often cleared on system - restart. In the example below we have ZooKeeper persist to - /user/local/zookeeper. - <configuration> - ... - <property> - <name>hbase.zookeeper.property.clientPort</name> - <value>2222</value> - <description>Property from ZooKeeper's config zoo.cfg. - The port at which the clients will connect. - </description> - </property> - <property> - <name>hbase.zookeeper.quorum</name> - <value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value> - <description>Comma separated list of servers in the ZooKeeper Quorum. - For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". - By default this is set to localhost for local and pseudo-distributed modes - of operation. For a fully-distributed setup, this should be set to a full - list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh - this is the list of servers which we will start/stop ZooKeeper on. - </description> - </property> - <property> - <name>hbase.zookeeper.property.dataDir</name> - <value>/usr/local/zookeeper</value> - <description>Property from ZooKeeper's config zoo.cfg. - The directory where the snapshot is stored. - </description> - </property> - ... - </configuration> - -
- Using existing ZooKeeper ensemble - - To point HBase at an existing ZooKeeper cluster, one that - is not managed by HBase, set HBASE_MANAGES_ZK - in conf/hbase-env.sh to false - - ... - # Tell HBase whether it should manage its own instance of Zookeeper or not. - export HBASE_MANAGES_ZK=false Next set ensemble locations - and client port, if non-standard, in - hbase-site.xml, or add a suitably - configured zoo.cfg to HBase's - CLASSPATH. HBase will prefer the - configuration found in zoo.cfg over any - settings in hbase-site.xml. - - When HBase manages ZooKeeper, it will start/stop the - ZooKeeper servers as a part of the regular start/stop scripts. - If you would like to run ZooKeeper yourself, independent of - HBase start/stop, you would do the following - - -${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper - - - Note that you can use HBase in this manner to spin up a - ZooKeeper cluster, unrelated to HBase. Just make sure to set - HBASE_MANAGES_ZK to false - if you want it to stay up across HBase restarts so that when - HBase shuts down, it doesn't take ZooKeeper down with it. - - For more information about running a distinct ZooKeeper - cluster, see the ZooKeeper Getting - Started Guide. Additionally, see the ZooKeeper Wiki or the - ZooKeeper documentation - for more information on ZooKeeper sizing. - -
- - -
- SASL Authentication with ZooKeeper - Newer releases of HBase (>= 0.92) will - support connecting to a ZooKeeper Quorum that supports - SASL authentication (which is available in Zookeeper - versions 3.4.0 or later). - - This describes how to set up HBase to mutually - authenticate with a ZooKeeper Quorum. ZooKeeper/HBase - mutual authentication (HBASE-2418) - is required as part of a complete secure HBase configuration - (HBASE-3025). - - For simplicity of explication, this section ignores - additional configuration required (Secure HDFS and Coprocessor - configuration). It's recommended to begin with an - HBase-managed Zookeeper configuration (as opposed to a - standalone Zookeeper quorum) for ease of learning. - - -
Operating System Prerequisites
- - - You need to have a working Kerberos KDC setup. For - each $HOST that will run a ZooKeeper - server, you should have a principle - zookeeper/$HOST. For each such host, - add a service key (using the kadmin or - kadmin.local tool's ktadd - command) for zookeeper/$HOST and copy - this file to $HOST, and make it - readable only to the user that will run zookeeper on - $HOST. Note the location of this file, - which we will use below as - $PATH_TO_ZOOKEEPER_KEYTAB. - - - - Similarly, for each $HOST that will run - an HBase server (master or regionserver), you should - have a principle: hbase/$HOST. For each - host, add a keytab file called - hbase.keytab containing a service - key for hbase/$HOST, copy this file to - $HOST, and make it readable only to the - user that will run an HBase service on - $HOST. Note the location of this file, - which we will use below as - $PATH_TO_HBASE_KEYTAB. - - - - Each user who will be an HBase client should also be - given a Kerberos principal. This principal should - usually have a password assigned to it (as opposed to, - as with the HBase servers, a keytab file) which only - this user knows. The client's principal's - maxrenewlife should be set so that it can - be renewed enough so that the user can complete their - HBase client processes. For example, if a user runs a - long-running HBase client process that takes at most 3 - days, we might create this user's principal within - kadmin with: addprinc -maxrenewlife - 3days. The Zookeeper client and server - libraries manage their own ticket refreshment by - running threads that wake up periodically to do the - refreshment. - - - On each host that will run an HBase client - (e.g. hbase shell), add the following - file to the HBase home directory's conf - directory: - - - Client { - com.sun.security.auth.module.Krb5LoginModule required - useKeyTab=false - useTicketCache=true; - }; - - - We'll refer to this JAAS configuration file as - $CLIENT_CONF below. - -
- HBase-managed Zookeeper Configuration - - On each node that will run a zookeeper, a - master, or a regionserver, create a JAAS - configuration file in the conf directory of the node's - HBASE_HOME directory that looks like the - following: - - - Server { - com.sun.security.auth.module.Krb5LoginModule required - useKeyTab=true - keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" - storeKey=true - useTicketCache=false - principal="zookeeper/$HOST"; - }; - Client { - com.sun.security.auth.module.Krb5LoginModule required - useKeyTab=true - useTicketCache=false - keyTab="$PATH_TO_HBASE_KEYTAB" - principal="hbase/$HOST"; - }; - - - where the $PATH_TO_HBASE_KEYTAB and - $PATH_TO_ZOOKEEPER_KEYTAB files are what - you created above, and $HOST is the hostname for that - node. - - The Server section will be used by - the Zookeeper quorum server, while the - Client section will be used by the HBase - master and regionservers. The path to this file should - be substituted for the text $HBASE_SERVER_CONF - in the hbase-env.sh - listing below. - - - The path to this file should be substituted for the - text $CLIENT_CONF in the - hbase-env.sh listing below. - - - Modify your hbase-env.sh to include the - following: - - - export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" - export HBASE_MANAGES_ZK=true - export HBASE_ZOOKEEPER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" - export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" - export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" - - - where $HBASE_SERVER_CONF and - $CLIENT_CONF are the full paths to the - JAAS configuration files created above. - - Modify your hbase-site.xml on each node - that will run zookeeper, master or regionserver to contain: - - - - hbase.zookeeper.quorum - $ZK_NODES - - - hbase.cluster.distributed - true - - - hbase.zookeeper.property.authProvider.1 - org.apache.zookeeper.server.auth.SASLAuthenticationProvider - - - hbase.zookeeper.property.kerberos.removeHostFromPrincipal - true - - - hbase.zookeeper.property.kerberos.removeRealmFromPrincipal - true - - - ]]> - - where $ZK_NODES is the - comma-separated list of hostnames of the Zookeeper - Quorum hosts. - - Start your hbase cluster by running one or more - of the following set of commands on the appropriate - hosts: - - - - bin/hbase zookeeper start - bin/hbase master start - bin/hbase regionserver start - - -
- -
External Zookeeper Configuration - Add a JAAS configuration file that looks like: - - - Client { - com.sun.security.auth.module.Krb5LoginModule required - useKeyTab=true - useTicketCache=false - keyTab="$PATH_TO_HBASE_KEYTAB" - principal="hbase/$HOST"; - }; - - - where the $PATH_TO_HBASE_KEYTAB is the keytab - created above for HBase services to run on this host, and $HOST is the - hostname for that node. Put this in the HBase home's - configuration directory. We'll refer to this file's - full pathname as $HBASE_SERVER_CONF below. - - Modify your hbase-env.sh to include the following: - - - export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" - export HBASE_MANAGES_ZK=false - export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" - export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" - - - - Modify your hbase-site.xml on each node - that will run a master or regionserver to contain: - - - - hbase.zookeeper.quorum - $ZK_NODES - - - hbase.cluster.distributed - true - - - ]]> - - - where $ZK_NODES is the - comma-separated list of hostnames of the Zookeeper - Quorum hosts. - - - Add a zoo.cfg for each Zookeeper Quorum host containing: - - authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider - kerberos.removeHostFromPrincipal=true - kerberos.removeRealmFromPrincipal=true - - - Also on each of these hosts, create a JAAS configuration file containing: - - - Server { - com.sun.security.auth.module.Krb5LoginModule required - useKeyTab=true - keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" - storeKey=true - useTicketCache=false - principal="zookeeper/$HOST"; - }; - - - where $HOST is the hostname of each - Quorum host. We will refer to the full pathname of - this file as $ZK_SERVER_CONF below. - - - - - Start your Zookeepers on each Zookeeper Quorum host with: - - - SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start - - - - - - Start your HBase cluster by running one or more of the following set of commands on the appropriate nodes: - - - - bin/hbase master start - bin/hbase regionserver start - - - -
- -
- Zookeeper Server Authentication Log Output - If the configuration above is successful, - you should see something similar to the following in - your Zookeeper server logs: - -11/12/05 22:43:39 INFO zookeeper.Login: successfully logged in. -11/12/05 22:43:39 INFO server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181 -11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh thread started. -11/12/05 22:43:39 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:39 UTC 2011 -11/12/05 22:43:39 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:39 UTC 2011 -11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:36:42 UTC 2011 -.. -11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: - Successfully authenticated client: authenticationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN; - authorizationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN. -11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Setting authorizedID: hbase -11/12/05 22:43:59 INFO server.ZooKeeperServer: adding SASL authorization for authorizationID: hbase - - - - -
- -
- Zookeeper Client Authentication Log Output - On the Zookeeper client side (HBase master or regionserver), - you should see something similar to the following: - - -11/12/05 22:43:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-166-175-249.us-west-1.compute.internal:2181 sessionTimeout=180000 watcher=master:60000 -11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Opening socket connection to server /10.166.175.249:2181 -11/12/05 22:43:59 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 14851@ip-10-166-175-249 -11/12/05 22:43:59 INFO zookeeper.Login: successfully logged in. -11/12/05 22:43:59 INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism. -11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh thread started. -11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, initiating session -11/12/05 22:43:59 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:59 UTC 2011 -11/12/05 22:43:59 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:59 UTC 2011 -11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:30:37 UTC 2011 -11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, sessionid = 0x134106594320000, negotiated timeout = 180000 - - -
- -
- Configuration from Scratch - - This has been tested on the current standard Amazon - Linux AMI. First setup KDC and principals as - described above. Next checkout code and run a sanity - check. - - - git clone git://git.apache.org/hbase.git - cd hbase - mvn -PlocalTests clean test -Dtest=TestZooKeeperACL - - - Then configure HBase as described above. - Manually edit target/cached_classpath.txt (see below).. - - - bin/hbase zookeeper & - bin/hbase master & - bin/hbase regionserver & - -
- - -
- Future improvements - -
Fix target/cached_classpath.txt - - You must override the standard hadoop-core jar file from the - target/cached_classpath.txt - file with the version containing the HADOOP-7070 fix. You can use the following script to do this: - - - echo `find ~/.m2 -name "*hadoop-core*7070*SNAPSHOT.jar"` ':' `cat target/cached_classpath.txt` | sed 's/ //g' > target/tmp.txt - mv target/tmp.txt target/cached_classpath.txt - - - - -
- -
- Set JAAS configuration - programmatically - - - This would avoid the need for a separate Hadoop jar - that fixes HADOOP-7070. -
- -
- Elimination of - <code>kerberos.removeHostFromPrincipal</code> and - <code>kerberos.removeRealmFromPrincipal</code> -
- -
- - -
- - - - - -
@@ -1704,34 +1151,4 @@ of all regions.
-
- Bloom Filter Configuration -
- <varname>io.hfile.bloom.enabled</varname> global kill - switch - - io.hfile.bloom.enabled in - Configuration serves as the kill switch in case - something goes wrong. Default = true. -
- -
- <varname>io.hfile.bloom.error.rate</varname> - - io.hfile.bloom.error.rate = average false - positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1 - bit per bloom entry. -
- -
- <varname>io.hfile.bloom.max.fold</varname> - - io.hfile.bloom.max.fold = guaranteed minimum - fold rate. Most people should leave this alone. Default = 7, or can - collapse to at least 1/128th of original size. See the - Development Process section of the document BloomFilters - in HBase for more on what this option means. -
-
Modified: hbase/trunk/src/docbkx/getting_started.xml URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/getting_started.xml?rev=1389153&r1=1389152&r2=1389153&view=diff ============================================================================== --- hbase/trunk/src/docbkx/getting_started.xml (original) +++ hbase/trunk/src/docbkx/getting_started.xml Sun Sep 23 22:01:16 2012 @@ -33,8 +33,9 @@ will get you up and running on a single-node instance of HBase using the local filesystem. - describes setup - of HBase in distributed mode running on top of HDFS. + describes basic system + requirements and configuration running HBase in distributed mode + on top of HDFS.
@@ -51,7 +52,7 @@ Choose a download site from this list of Apache Download - Mirrors. Click on suggested top link. This will take you to a + Mirrors. Click on the suggested top link. This will take you to a mirror of HBase Releases. Click on the folder named stable and then download the file that ends in .tar.gz to your local filesystem; e.g. @@ -65,24 +66,21 @@ $ cd hbase- At this point, you are ready to start HBase. But before starting - it, you might want to edit conf/hbase-site.xml and - set the directory you want HBase to write to, - hbase.rootdir. - -<?xml version="1.0"?> + it, you might want to edit conf/hbase-site.xml, the + file you write your site-specific configurations into, and + set hbase.rootdir, the directory HBase writes data to, +<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///DIRECTORY/hbase</value> </property> -</configuration> - - Replace DIRECTORY in the above with a - path to a directory where you want HBase to store its data. By default, +</configuration> Replace DIRECTORY in the above with the + path to the directory where you want HBase to store its data. By default, hbase.rootdir is set to /tmp/hbase-${user.name} which means you'll lose all - your data whenever your server reboots (Most operating systems clear + your data whenever your server reboots unless you change it (Most operating systems clear /tmp on restart).
@@ -96,7 +94,7 @@ starting Master, logging to logs/hbase-u standalone mode, HBase runs all daemons in the the one JVM; i.e. both the HBase and ZooKeeper daemons. HBase logs can be found in the logs subdirectory. Check them out especially if - HBase had trouble starting. + it seems HBase had trouble starting. Is <application>java</application> installed? @@ -108,7 +106,7 @@ starting Master, logging to logs/hbase-u options the java program takes (HBase requires java 6). If this is not the case, HBase will not start. Install java, edit conf/hbase-env.sh, uncommenting the - JAVA_HOME line pointing it to your java install. Then, + JAVA_HOME line pointing it to your java install, then, retry the steps above. @@ -154,9 +152,7 @@ hbase(main):006:0> put 'test', 'row3' cf in this example -- followed by a colon and then a column qualifier suffix (a in this case). - Verify the data insert. - - Run a scan of the table by doing the following + Verify the data insert by running a scan of the table as follows hbase(main):007:0> scan 'test' ROW COLUMN+CELL @@ -165,7 +161,7 @@ row2 column=cf:b, timestamp=128838 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds - Get a single row as follows + Get a single row hbase(main):008:0> get 'test', 'row1' COLUMN CELL @@ -198,9 +194,9 @@ stopping hbase...............Where to go next The above described standalone setup is good for testing and - experiments only. Next move on to where we'll go into - depth on the different HBase run modes, requirements and critical - configurations needed setting up a distributed HBase deploy. + experiments only. In the next chapter, , + we'll go into depth on the different HBase run modes, system requirements + running HBase, and critical configurations setting up a distributed HBase deploy. Modified: hbase/trunk/src/docbkx/performance.xml URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/performance.xml?rev=1389153&r1=1389152&r2=1389153&view=diff ============================================================================== --- hbase/trunk/src/docbkx/performance.xml (original) +++ hbase/trunk/src/docbkx/performance.xml Sun Sep 23 22:01:16 2012 @@ -526,6 +526,96 @@ htable.close(); too few regions then the reads could likely be served from too few nodes. See , as well as +
+ Bloom Filters + Enabling Bloom Filters can save your having to go to disk and + can help improve read latencys. + Bloom filters were developed over in HBase-1200 + Add bloomfilters. + For description of the development process -- why static blooms + rather than dynamic -- and for an overview of the unique properties + that pertain to blooms in HBase, as well as possible future + directions, see the Development Process section + of the document BloomFilters + in HBase attached to HBase-1200. + + The bloom filters described here are actually version two of + blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom + option based on work done by the European Commission One-Lab + Project 034819. The core of the HBase bloom work was later + pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. + Version 1 of HBase blooms never worked that well. Version 2 is a + rewrite from scratch though again it starts with the one-lab + work. + + See also . + + +
+ Bloom StoreFile footprint + + Bloom filters add an entry to the StoreFile + general FileInfo data structure and then two + extra entries to the StoreFile metadata + section. + +
+ BloomFilter in the <classname>StoreFile</classname> + <classname>FileInfo</classname> data structure + + FileInfo has a + BLOOM_FILTER_TYPE entry which is set to + NONE, ROW or + ROWCOL. +
+ +
+ BloomFilter entries in <classname>StoreFile</classname> + metadata + + BLOOM_FILTER_META holds Bloom Size, Hash + Function used, etc. Its small in size and is cached on + StoreFile.Reader load + BLOOM_FILTER_DATA is the actual bloomfilter + data. Obtained on-demand. Stored in the LRU cache, if it is enabled + (Its enabled by default). +
+
+
+ Bloom Filter Configuration +
+ <varname>io.hfile.bloom.enabled</varname> global kill + switch + + io.hfile.bloom.enabled in + Configuration serves as the kill switch in case + something goes wrong. Default = true. +
+ +
+ <varname>io.hfile.bloom.error.rate</varname> + + io.hfile.bloom.error.rate = average false + positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1 + bit per bloom entry. +
+ +
+ <varname>io.hfile.bloom.max.fold</varname> + + io.hfile.bloom.max.fold = guaranteed minimum + fold rate. Most people should leave this alone. Default = 7, or can + collapse to at least 1/128th of original size. See the + Development Process section of the document BloomFilters + in HBase for more on what this option means. +
+
+
Added: hbase/trunk/src/docbkx/zookeeper.xml URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/zookeeper.xml?rev=1389153&view=auto ============================================================================== --- hbase/trunk/src/docbkx/zookeeper.xml (added) +++ hbase/trunk/src/docbkx/zookeeper.xml Sun Sep 23 22:01:16 2012 @@ -0,0 +1,586 @@ + + + + + ZooKeeper<indexterm> + <primary>ZooKeeper</primary> + </indexterm> + + A distributed HBase depends on a running ZooKeeper cluster. + All participating nodes and clients need to be able to access the + running ZooKeeper ensemble. HBase by default manages a ZooKeeper + "cluster" for you. It will start and stop the ZooKeeper ensemble + as part of the HBase start/stop process. You can also manage the + ZooKeeper ensemble independent of HBase and just point HBase at + the cluster it should use. To toggle HBase management of + ZooKeeper, use the HBASE_MANAGES_ZK variable in + conf/hbase-env.sh. This variable, which + defaults to true, tells HBase whether to + start/stop the ZooKeeper ensemble servers as part of HBase + start/stop. + + When HBase manages the ZooKeeper ensemble, you can specify + ZooKeeper configuration using its native + zoo.cfg file, or, the easier option is to + just specify ZooKeeper options directly in + conf/hbase-site.xml. A ZooKeeper + configuration option can be set as a property in the HBase + hbase-site.xml XML configuration file by + prefacing the ZooKeeper option name with + hbase.zookeeper.property. For example, the + clientPort setting in ZooKeeper can be changed + by setting the + hbase.zookeeper.property.clientPort property. + For all default values used by HBase, including ZooKeeper + configuration, see . Look for the + hbase.zookeeper.property prefix + For the full list of ZooKeeper configurations, see + ZooKeeper's zoo.cfg. HBase does not ship + with a zoo.cfg so you will need to browse + the conf directory in an appropriate + ZooKeeper download. + + + You must at least list the ensemble servers in + hbase-site.xml using the + hbase.zookeeper.quorum property. This property + defaults to a single ensemble member at + localhost which is not suitable for a fully + distributed HBase. (It binds to the local machine only and remote + clients will not be able to connect). + How many ZooKeepers should I run? + + You can run a ZooKeeper ensemble that comprises 1 node + only but in production it is recommended that you run a + ZooKeeper ensemble of 3, 5 or 7 machines; the more members an + ensemble has, the more tolerant the ensemble is of host + failures. Also, run an odd number of machines. In ZooKeeper, + an even number of peers is supported, but it is normally not used + because an even sized ensemble requires, proportionally, more peers + to form a quorum than an odd sized ensemble requires. For example, an + ensemble with 4 peers requires 3 to form a quorum, while an ensemble with + 5 also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to + fail, and thus is more fault tolerant than the ensemble of 4, which allows + only 1 down peer. + + Give each ZooKeeper server around 1GB of RAM, and if possible, its own + dedicated disk (A dedicated disk is the best thing you can do + to ensure a performant ZooKeeper ensemble). For very heavily + loaded clusters, run ZooKeeper servers on separate machines + from RegionServers (DataNodes and TaskTrackers). + + + For example, to have HBase manage a ZooKeeper quorum on + nodes rs{1,2,3,4,5}.example.com, bound to + port 2222 (the default is 2181) ensure + HBASE_MANAGE_ZK is commented out or set to + true in conf/hbase-env.sh + and then edit conf/hbase-site.xml and set + hbase.zookeeper.property.clientPort and + hbase.zookeeper.quorum. You should also set + hbase.zookeeper.property.dataDir to other than + the default as the default has ZooKeeper persist data under + /tmp which is often cleared on system + restart. In the example below we have ZooKeeper persist to + /user/local/zookeeper. + <configuration> + ... + <property> + <name>hbase.zookeeper.property.clientPort</name> + <value>2222</value> + <description>Property from ZooKeeper's config zoo.cfg. + The port at which the clients will connect. + </description> + </property> + <property> + <name>hbase.zookeeper.quorum</name> + <value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value> + <description>Comma separated list of servers in the ZooKeeper Quorum. + For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". + By default this is set to localhost for local and pseudo-distributed modes + of operation. For a fully-distributed setup, this should be set to a full + list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh + this is the list of servers which we will start/stop ZooKeeper on. + </description> + </property> + <property> + <name>hbase.zookeeper.property.dataDir</name> + <value>/usr/local/zookeeper</value> + <description>Property from ZooKeeper's config zoo.cfg. + The directory where the snapshot is stored. + </description> + </property> + ... + </configuration> + +
+ Using existing ZooKeeper ensemble + + To point HBase at an existing ZooKeeper cluster, one that + is not managed by HBase, set HBASE_MANAGES_ZK + in conf/hbase-env.sh to false + + ... + # Tell HBase whether it should manage its own instance of Zookeeper or not. + export HBASE_MANAGES_ZK=false Next set ensemble locations + and client port, if non-standard, in + hbase-site.xml, or add a suitably + configured zoo.cfg to HBase's + CLASSPATH. HBase will prefer the + configuration found in zoo.cfg over any + settings in hbase-site.xml. + + When HBase manages ZooKeeper, it will start/stop the + ZooKeeper servers as a part of the regular start/stop scripts. + If you would like to run ZooKeeper yourself, independent of + HBase start/stop, you would do the following + + +${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper + + + Note that you can use HBase in this manner to spin up a + ZooKeeper cluster, unrelated to HBase. Just make sure to set + HBASE_MANAGES_ZK to false + if you want it to stay up across HBase restarts so that when + HBase shuts down, it doesn't take ZooKeeper down with it. + + For more information about running a distinct ZooKeeper + cluster, see the ZooKeeper Getting + Started Guide. Additionally, see the ZooKeeper Wiki or the + ZooKeeper documentation + for more information on ZooKeeper sizing. + +
+ + +
+ SASL Authentication with ZooKeeper + Newer releases of HBase (>= 0.92) will + support connecting to a ZooKeeper Quorum that supports + SASL authentication (which is available in Zookeeper + versions 3.4.0 or later). + + This describes how to set up HBase to mutually + authenticate with a ZooKeeper Quorum. ZooKeeper/HBase + mutual authentication (HBASE-2418) + is required as part of a complete secure HBase configuration + (HBASE-3025). + + For simplicity of explication, this section ignores + additional configuration required (Secure HDFS and Coprocessor + configuration). It's recommended to begin with an + HBase-managed Zookeeper configuration (as opposed to a + standalone Zookeeper quorum) for ease of learning. + + +
Operating System Prerequisites
+ + + You need to have a working Kerberos KDC setup. For + each $HOST that will run a ZooKeeper + server, you should have a principle + zookeeper/$HOST. For each such host, + add a service key (using the kadmin or + kadmin.local tool's ktadd + command) for zookeeper/$HOST and copy + this file to $HOST, and make it + readable only to the user that will run zookeeper on + $HOST. Note the location of this file, + which we will use below as + $PATH_TO_ZOOKEEPER_KEYTAB. + + + + Similarly, for each $HOST that will run + an HBase server (master or regionserver), you should + have a principle: hbase/$HOST. For each + host, add a keytab file called + hbase.keytab containing a service + key for hbase/$HOST, copy this file to + $HOST, and make it readable only to the + user that will run an HBase service on + $HOST. Note the location of this file, + which we will use below as + $PATH_TO_HBASE_KEYTAB. + + + + Each user who will be an HBase client should also be + given a Kerberos principal. This principal should + usually have a password assigned to it (as opposed to, + as with the HBase servers, a keytab file) which only + this user knows. The client's principal's + maxrenewlife should be set so that it can + be renewed enough so that the user can complete their + HBase client processes. For example, if a user runs a + long-running HBase client process that takes at most 3 + days, we might create this user's principal within + kadmin with: addprinc -maxrenewlife + 3days. The Zookeeper client and server + libraries manage their own ticket refreshment by + running threads that wake up periodically to do the + refreshment. + + + On each host that will run an HBase client + (e.g. hbase shell), add the following + file to the HBase home directory's conf + directory: + + + Client { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=false + useTicketCache=true; + }; + + + We'll refer to this JAAS configuration file as + $CLIENT_CONF below. + +
+ HBase-managed Zookeeper Configuration + + On each node that will run a zookeeper, a + master, or a regionserver, create a JAAS + configuration file in the conf directory of the node's + HBASE_HOME directory that looks like the + following: + + + Server { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" + storeKey=true + useTicketCache=false + principal="zookeeper/$HOST"; + }; + Client { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + useTicketCache=false + keyTab="$PATH_TO_HBASE_KEYTAB" + principal="hbase/$HOST"; + }; + + + where the $PATH_TO_HBASE_KEYTAB and + $PATH_TO_ZOOKEEPER_KEYTAB files are what + you created above, and $HOST is the hostname for that + node. + + The Server section will be used by + the Zookeeper quorum server, while the + Client section will be used by the HBase + master and regionservers. The path to this file should + be substituted for the text $HBASE_SERVER_CONF + in the hbase-env.sh + listing below. + + + The path to this file should be substituted for the + text $CLIENT_CONF in the + hbase-env.sh listing below. + + + Modify your hbase-env.sh to include the + following: + + + export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" + export HBASE_MANAGES_ZK=true + export HBASE_ZOOKEEPER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" + export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" + export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" + + + where $HBASE_SERVER_CONF and + $CLIENT_CONF are the full paths to the + JAAS configuration files created above. + + Modify your hbase-site.xml on each node + that will run zookeeper, master or regionserver to contain: + + + + hbase.zookeeper.quorum + $ZK_NODES + + + hbase.cluster.distributed + true + + + hbase.zookeeper.property.authProvider.1 + org.apache.zookeeper.server.auth.SASLAuthenticationProvider + + + hbase.zookeeper.property.kerberos.removeHostFromPrincipal + true + + + hbase.zookeeper.property.kerberos.removeRealmFromPrincipal + true + + + ]]> + + where $ZK_NODES is the + comma-separated list of hostnames of the Zookeeper + Quorum hosts. + + Start your hbase cluster by running one or more + of the following set of commands on the appropriate + hosts: + + + + bin/hbase zookeeper start + bin/hbase master start + bin/hbase regionserver start + + +
+ +
External Zookeeper Configuration + Add a JAAS configuration file that looks like: + + + Client { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + useTicketCache=false + keyTab="$PATH_TO_HBASE_KEYTAB" + principal="hbase/$HOST"; + }; + + + where the $PATH_TO_HBASE_KEYTAB is the keytab + created above for HBase services to run on this host, and $HOST is the + hostname for that node. Put this in the HBase home's + configuration directory. We'll refer to this file's + full pathname as $HBASE_SERVER_CONF below. + + Modify your hbase-env.sh to include the following: + + + export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" + export HBASE_MANAGES_ZK=false + export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" + export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" + + + + Modify your hbase-site.xml on each node + that will run a master or regionserver to contain: + + + + hbase.zookeeper.quorum + $ZK_NODES + + + hbase.cluster.distributed + true + + + ]]> + + + where $ZK_NODES is the + comma-separated list of hostnames of the Zookeeper + Quorum hosts. + + + Add a zoo.cfg for each Zookeeper Quorum host containing: + + authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider + kerberos.removeHostFromPrincipal=true + kerberos.removeRealmFromPrincipal=true + + + Also on each of these hosts, create a JAAS configuration file containing: + + + Server { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" + storeKey=true + useTicketCache=false + principal="zookeeper/$HOST"; + }; + + + where $HOST is the hostname of each + Quorum host. We will refer to the full pathname of + this file as $ZK_SERVER_CONF below. + + + + + Start your Zookeepers on each Zookeeper Quorum host with: + + + SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start + + + + + + Start your HBase cluster by running one or more of the following set of commands on the appropriate nodes: + + + + bin/hbase master start + bin/hbase regionserver start + + + +
+ +
+ Zookeeper Server Authentication Log Output + If the configuration above is successful, + you should see something similar to the following in + your Zookeeper server logs: + +11/12/05 22:43:39 INFO zookeeper.Login: successfully logged in. +11/12/05 22:43:39 INFO server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181 +11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh thread started. +11/12/05 22:43:39 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:39 UTC 2011 +11/12/05 22:43:39 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:39 UTC 2011 +11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:36:42 UTC 2011 +.. +11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: + Successfully authenticated client: authenticationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN; + authorizationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN. +11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Setting authorizedID: hbase +11/12/05 22:43:59 INFO server.ZooKeeperServer: adding SASL authorization for authorizationID: hbase + + + + +
+ +
+ Zookeeper Client Authentication Log Output + On the Zookeeper client side (HBase master or regionserver), + you should see something similar to the following: + + +11/12/05 22:43:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-166-175-249.us-west-1.compute.internal:2181 sessionTimeout=180000 watcher=master:60000 +11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Opening socket connection to server /10.166.175.249:2181 +11/12/05 22:43:59 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 14851@ip-10-166-175-249 +11/12/05 22:43:59 INFO zookeeper.Login: successfully logged in. +11/12/05 22:43:59 INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism. +11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh thread started. +11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, initiating session +11/12/05 22:43:59 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:59 UTC 2011 +11/12/05 22:43:59 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:59 UTC 2011 +11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:30:37 UTC 2011 +11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, sessionid = 0x134106594320000, negotiated timeout = 180000 + + +
+ +
+ Configuration from Scratch + + This has been tested on the current standard Amazon + Linux AMI. First setup KDC and principals as + described above. Next checkout code and run a sanity + check. + + + git clone git://git.apache.org/hbase.git + cd hbase + mvn -PlocalTests clean test -Dtest=TestZooKeeperACL + + + Then configure HBase as described above. + Manually edit target/cached_classpath.txt (see below).. + + + bin/hbase zookeeper & + bin/hbase master & + bin/hbase regionserver & + +
+ + +
+ Future improvements + +
Fix target/cached_classpath.txt + + You must override the standard hadoop-core jar file from the + target/cached_classpath.txt + file with the version containing the HADOOP-7070 fix. You can use the following script to do this: + + + echo `find ~/.m2 -name "*hadoop-core*7070*SNAPSHOT.jar"` ':' `cat target/cached_classpath.txt` | sed 's/ //g' > target/tmp.txt + mv target/tmp.txt target/cached_classpath.txt + + + + +
+ +
+ Set JAAS configuration + programmatically + + + This would avoid the need for a separate Hadoop jar + that fixes HADOOP-7070. +
+ +
+ Elimination of + <code>kerberos.removeHostFromPrincipal</code> and + <code>kerberos.removeRealmFromPrincipal</code> +
+ +
+ + +
+ + + + +