hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mi...@apache.org
Subject [28/31] hbase git commit: Changed org structure, tried in vain to keep generated classes from being picked up by Javadoc
Date Fri, 09 Jan 2015 09:35:11 GMT
http://git-wip-us.apache.org/repos/asf/hbase/blob/c0bcf7cc/src/main/asciidoc/chapters/architecture.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/chapters/architecture.adoc b/src/main/asciidoc/chapters/architecture.adoc
new file mode 100644
index 0000000..3b04f28
--- /dev/null
+++ b/src/main/asciidoc/chapters/architecture.adoc
@@ -0,0 +1,2523 @@
+////
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+////
+
+= Architecture
+:doctype: book
+:numbered:
+:toc: left
+:icons: font
+:experimental:
+:toc: left
+:source-language: java
+:docinfo1: 
+
+[[arch.overview]]
+== Overview
+
+[[arch.overview.nosql]]
+=== NoSQL?
+
+HBase is a type of "NoSQL" database.
+"NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases:  BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database.
+Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. 
+
+However, HBase has many features which supports both linear and modular scaling.
+HBase clusters expand by adding RegionServers that are hosted on commodity class servers.
+If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
+RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices.
+HBase features of note are: 
+
+* Strongly consistent reads/writes:  HBase is not an "eventually consistent" DataStore.
+  This makes it very suitable for tasks such as high-speed counter aggregation.
+* Automatic sharding:  HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
+* Automatic RegionServer failover
+* Hadoop/HDFS Integration:  HBase supports HDFS out of the box as its distributed file system.
+* MapReduce:  HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
+* Java Client API:  HBase supports an easy to use Java API for programmatic access.
+* Thrift/REST API:  HBase also supports Thrift and REST for non-Java front-ends.
+* Block Cache and Bloom Filters:  HBase supports a Block Cache and Bloom Filters for high volume query optimization.
+* Operational Management:  HBase provides build-in web-pages for operational insight as well as JMX metrics.	  
+
+[[arch.overview.when]]
+=== When Should I Use HBase?
+
+HBase isn't suitable for every problem.
+
+First, make sure you have enough data.
+If you have hundreds of millions or billions of rows, then HBase is a good candidate.
+If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. 
+
+Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.)  An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example.
+Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. 
+
+Third, make sure you have enough hardware.
+Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. 
+
+HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. 
+
+[[arch.overview.hbasehdfs]]
+=== What Is The Difference Between HBase and Hadoop/HDFS?
+
+link:http://hadoop.apache.org/hdfs/[HDFS] is a distributed file system that is well suited for the storage of large files.
+Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.
+HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.
+This can sometimes be a point of conceptual confusion.
+HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
+See the <<datamodel,datamodel>> and the rest of this chapter for more information on how HBase achieves its goals. 
+
+[[arch.catalog]]
+== Catalog Tables
+
+The catalog table [code]+hbase:meta+ exists as an HBase table and is filtered out of the HBase shell's [code]+list+ command, but is in fact a table just like any other. 
+
+[[arch.catalog.root]]
+=== -ROOT-
+
+NOTE: The [code]+-ROOT-+ table was removed in HBase 0.96.0.
+Information here should be considered historical.
+
+The [code]+-ROOT-+ table kept track of the location of the [code]+.META+ table (the previous name for the table now called [code]+hbase:meta+) prior to HBase 0.96.
+The [code]+-ROOT-+ table structure was as follows: 
+
+* .Key.META.
+  region key ([code]+.META.,,1+)
+
+* .Values[code]+info:regioninfo+ (serialized link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[HRegionInfo]              instance of hbase:meta)
+* [code]+info:server+ (server:port of the RegionServer holding hbase:meta)
+* [code]+info:serverstartcode+ (start-time of the RegionServer process holding hbase:meta)
+
+[[arch.catalog.meta]]
+=== hbase:meta
+
+The [code]+hbase:meta+ table (previously called [code]+.META.+) keeps a list of all regions in the system.
+The location of [code]+hbase:meta+ was previously tracked within the [code]+-ROOT-+ table, but is now stored in Zookeeper.
+
+The [code]+hbase:meta+ table structure is as follows: 
+
+* .KeyRegion key of the format ([code]+[table],[region start key],[region
+  id]+)
+
+* .Values[code]+info:regioninfo+ (serialized link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[
+  HRegionInfo] instance for this region)
+* [code]+info:server+ (server:port of the RegionServer containing this region)
+* [code]+info:serverstartcode+ (start-time of the RegionServer process containing this region)
+
+When a table is in the process of splitting, two other columns will be created, called [code]+info:splitA+ and [code]+info:splitB+.
+These columns represent the two daughter regions.
+The values for these columns are also serialized HRegionInfo instances.
+After the region has been split, eventually this row will be deleted. 
+
+.Note on HRegionInfo
+[NOTE]
+====
+The empty key is used to denote table start and table end.
+A region with an empty start key is the first region in a table.
+If a region has both an empty start and an empty end key, it is the only region in the table 
+====
+
+In the (hopefully unlikely) event that programmatic processing of catalog metadata is required, see the link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29[Writables]          utility. 
+
+[[arch.catalog.startup]]
+=== Startup Sequencing
+
+First, the location of [code]+hbase:meta+ is looked up in Zookeeper.
+Next, [code]+hbase:meta+ is updated with server and startcode values.
+
+For information on region-RegionServer assignment, see <<regions.arch.assignment,regions.arch.assignment>>. 
+
+[[architecture.client]]
+== Client
+
+The HBase client finds the RegionServers that are serving the particular row range of interest.
+It does this by querying the [code]+hbase:meta+ table.
+See <<arch.catalog.meta,arch.catalog.meta>> for details.
+After locating the required region(s), the client contacts the RegionServer serving that region, rather than going through the master, and issues the read or write request.
+This information is cached in the client so that subsequent requests need not go through the lookup process.
+Should a region be reassigned either by the master load balancer or because a RegionServer has died, the client will requery the catalog tables to determine the new location of the user region. 
+
+See <<master.runtime,master.runtime>> for more information about the impact of the Master on HBase Client communication. 
+
+Administrative functions are done via an instance of link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin]      
+
+[[client.connections]]
+=== Cluster Connections
+
+The API changed in HBase 1.0.
+Its been cleaned up and users are returned Interfaces to work against rather than particular types.
+In HBase 1.0, obtain a cluster Connection from ConnectionFactory and thereafter, get from it instances of Table, Admin, and RegionLocator on an as-need basis.
+When done, close obtained instances.
+Finally, be sure to cleanup your Connection instance before exiting.
+Connections are heavyweight objects.
+Create once and keep an instance around.
+Table, Admin and RegionLocator instances are lightweight.
+Create as you go and then let go as soon as you are done by closing them.
+See the link:/Users/stack/checkouts/hbase.git/target/site/apidocs/org/apache/hadoop/hbase/client/package-summary.html[Client Package Javadoc Description] for example usage of the new HBase 1.0 API.
+
+For connection configuration information, see <<client_dependencies,client dependencies>>. 
+
+_link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table]
+            instances are not thread-safe_.
+Only one thread can use an instance of Table at any given time.
+When creating Table instances, it is advisable to use the same link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration]          instance.
+This will ensure sharing of ZooKeeper and socket instances to the RegionServers which is usually what you want.
+For example, this is preferred:
+
+[source,java]
+----
+HBaseConfiguration conf = HBaseConfiguration.create();
+HTable table1 = new HTable(conf, "myTable");
+HTable table2 = new HTable(conf, "myTable");
+----
+
+as opposed to this:
+
+[source,java]
+----
+HBaseConfiguration conf1 = HBaseConfiguration.create();
+HTable table1 = new HTable(conf1, "myTable");
+HBaseConfiguration conf2 = HBaseConfiguration.create();
+HTable table2 = new HTable(conf2, "myTable");
+----
+
+For more information about how connections are handled in the HBase client, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html[HConnectionManager]. 
+
+[[client.connection.pooling]]
+==== Connection Pooling
+
+For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads in a single JVM), you can pre-create an [class]+HConnection+, as shown in the following example:
+
+.Pre-Creating a [code]+HConnection+
+====
+[source,java]
+----
+// Create a connection to the cluster.
+HConnection connection = HConnectionManager.createConnection(Configuration);
+HTableInterface table = connection.getTable("myTable");
+// use table as needed, the table returned is lightweight
+table.close();
+// use the connection for other access to the cluster
+connection.close();
+----
+====
+
+Constructing HTableInterface implementation is very lightweight and resources are controlled.
+
+.[code]+HTablePool+ is Deprecated
+[WARNING]
+====
+Previous versions of this guide discussed [code]+HTablePool+, which was deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by link:https://issues.apache.org/jira/browse/HBASE-6580[HBASE-6500].
+Please use link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html[HConnection] instead.
+====
+
+[[client.writebuffer]]
+=== WriteBuffer and Batch Methods
+
+If <<perf.hbase.client.autoflush,perf.hbase.client.autoflush>> is turned off on link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable], [class]+Put+s are sent to RegionServers when the writebuffer is filled.
+The writebuffer is 2MB by default.
+Before an HTable instance is discarded, either [method]+close()+ or [method]+flushCommits()+ should be invoked so Puts will not be lost. 
+
+Note: [code]+htable.delete(Delete);+ does not go in the writebuffer!  This only applies to Puts. 
+
+For additional information on write durability, review the link:../acid-semantics.html[ACID semantics] page. 
+
+For fine-grained control of batching of [class]+Put+s or [class]+Delete+s, see the link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29[batch] methods on HTable. 
+
+[[client.external]]
+=== External Clients
+
+Information on non-Java clients and custom protocols is covered in <<external_apis,external apis>>           
+
+[[client.filter]]
+== Client Request Filters
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] and link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be optionally configured with link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html[filters] which are applied on the RegionServer. 
+
+Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups of Filter functionality. 
+
+[[client.filter.structural]]
+=== Structural
+
+Structural Filters contain other Filters.
+
+[[client.filter.structural.fl]]
+==== FilterList
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html[FilterList]          represents a list of Filters with a relationship of [code]+FilterList.Operator.MUST_PASS_ALL+ or [code]+FilterList.Operator.MUST_PASS_ONE+ between the Filters.
+The following example shows an 'or' between two Filters (checking for either 'my value' or 'my other value' on the same attribute).
+
+[source,java]
+----
+
+FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE);
+SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	Bytes.toBytes("my value")
+	);
+list.add(filter1);
+SingleColumnValueFilter filter2 = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	Bytes.toBytes("my other value")
+	);
+list.add(filter2);
+scan.setFilter(list);
+----
+
+[[client.filter.cv]]
+=== Column Value
+
+[[client.filter.cv.scvf]]
+==== SingleColumnValueFilter
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html[SingleColumnValueFilter]            can be used to test column values for equivalence ([code]+CompareOp.EQUAL
+            +), inequality ([code]+CompareOp.NOT_EQUAL+), or ranges (e.g., [code]+CompareOp.GREATER+). The following is example of testing equivalence a column to a String value "my value"...
+
+[source,java]
+----
+
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	Bytes.toBytes("my value")
+	);
+scan.setFilter(filter);
+----
+
+[[client.filter.cvp]]
+=== Column Value Comparators
+
+There are several Comparator classes in the Filter package that deserve special mention.
+These Comparators are used in concert with other Filters, such as <<client.filter.cv.scvf,client.filter.cv.scvf>>. 
+
+[[client.filter.cvp.rcs]]
+==== RegexStringComparator
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html[RegexStringComparator]            supports regular expressions for value comparisons.
+
+[source,java]
+----
+
+RegexStringComparator comp = new RegexStringComparator("my.");   // any value that starts with 'my'
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	comp
+	);
+scan.setFilter(filter);
+----
+
+See the Oracle JavaDoc for link:http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html[supported
+              RegEx patterns in Java]. 
+
+[[client.filter.cvp.substringcomparator]]
+==== SubstringComparator
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html[SubstringComparator]            can be used to determine if a given substring exists in a value.
+The comparison is case-insensitive. 
+
+[source,java]
+----
+
+SubstringComparator comp = new SubstringComparator("y val");   // looking for 'my value'
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	comp
+	);
+scan.setFilter(filter);
+----
+
+[[client.filter.cvp.bfp]]
+==== BinaryPrefixComparator
+
+See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.html[BinaryPrefixComparator].
+
+[[client.filter.cvp.bc]]
+==== BinaryComparator
+
+See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComparator.html[BinaryComparator].
+
+[[client.filter.kvm]]
+=== KeyValue Metadata
+
+As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to values the previous section. 
+
+[[client.filter.kvm.ff]]
+==== FamilyFilter
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html[FamilyFilter]            can be used to filter on the ColumnFamily.
+It is generally a better idea to select ColumnFamilies in the Scan than to do it with a Filter.
+
+[[client.filter.kvm.qf]]
+==== QualifierFilter
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html[QualifierFilter]            can be used to filter based on Column (aka Qualifier) name. 
+
+[[client.filter.kvm.cpf]]
+==== ColumnPrefixFilter
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html[ColumnPrefixFilter]            can be used to filter based on the lead portion of Column (aka Qualifier) names. 
+
+A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row and for each involved column family.
+It can be used to efficiently get a subset of the columns in very wide rows. 
+
+Note: The same column qualifier can be used in different column families.
+This filter returns all matching columns. 
+
+Example: Find all columns in a row and family that start with "abc"
+
+[source,java]
+----
+
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[] prefix = Bytes.toBytes("abc");
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new ColumnPrefixFilter(prefix);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+----
+
+[[client.filter.kvm.mcpf]]
+==== MultipleColumnPrefixFilter
+
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html[MultipleColumnPrefixFilter]            behaves like ColumnPrefixFilter but allows specifying multiple prefixes. 
+
+Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the first column matching the lowest prefix and also seeks past ranges of columns between prefixes.
+It can be used to efficiently get discontinuous sets of columns from very wide rows. 
+
+Example: Find all columns in a row and family that start with "abc" or "xyz"
+
+[source,java]
+----
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")};
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new MultipleColumnPrefixFilter(prefixes);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+----
+
+[[client.filter.kvm.crf]]
+==== ColumnRangeFilter
+
+A link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html[ColumnRangeFilter] allows efficient intra row scanning. 
+
+A ColumnRangeFilter can seek ahead to the first matching column for each involved column family.
+It can be used to efficiently get a 'slice' of the columns of a very wide row.
+i.e.
+you have a million columns in a row but you only want to look at columns bbbb-bbdd. 
+
+Note: The same column qualifier can be used in different column families.
+This filter returns all matching columns. 
+
+Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" (inclusive)
+
+[source,java]
+----
+
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[] startColumn = Bytes.toBytes("bbbb");
+byte[] endColumn = Bytes.toBytes("bbdd");
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+----
+
+Note:  Introduced in HBase 0.92
+
+[[client.filter.row]]
+=== RowKey
+
+[[client.filter.row.rf]]
+==== RowFilter
+
+It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RowFilter.html[RowFilter] can also be used.
+
+[[client.filter.utility]]
+=== Utility
+
+[[client.filter.utility.fkof]]
+==== FirstKeyOnlyFilter
+
+This is primarily used for rowcount jobs.
+See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter].
+
+== Master
+
+[code]+HMaster+ is the implementation of the Master Server.
+The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes.
+In a distributed cluster, the Master typically runs on the <<arch.hdfs.nn,arch.hdfs.nn>>.
+J Mohamed Zahoor goes into some more detail on the Master Architecture in this blog posting, link:http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/[HBase HMaster
+          Architecture ].
+
+[[master.startup]]
+=== Startup Behavior
+
+If run in a multi-Master environment, all Masters compete to run the cluster.
+If the active Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to take over the Master role. 
+
+[[master.runtime]]
+=== Runtime Impact
+
+A common dist-list question involves what happens to an HBase cluster when the Master goes down.
+Because the HBase client talks directly to the RegionServers, the cluster can still function in a "steady state." Additionally, per <<arch.catalog,arch.catalog>>, [code]+hbase:meta+ exists as an HBase table and is not resident in the Master.
+However, the Master controls critical functions such as RegionServer failover and completing region splits.
+So while the cluster can still run for a short time without the Master, the Master should be restarted as soon as possible. 
+
+[[master.api]]
+=== Interface
+
+The methods exposed by [code]+HMasterInterface+ are primarily metadata-oriented methods: 
+
+* Table (createTable, modifyTable, removeTable, enable, disable) 
+* ColumnFamily (addColumn, modifyColumn, removeColumn) 
+* Region (move, assign, unassign)          For example, when the [code]+HBaseAdmin+ method [code]+disableTable+ is invoked, it is serviced by the Master server. 
+
+[[master.processes]]
+=== Processes
+
+The Master runs several background threads: 
+
+[[master.processes.loadbalancer]]
+==== LoadBalancer
+
+Periodically, and when there are no regions in transition, a load balancer will run and move regions around to balance the cluster's load.
+See <<balancer_config,balancer config>> for configuring this property.
+
+See <<regions.arch.assignment,regions.arch.assignment>> for more information on region assignment. 
+
+[[master.processes.catalog]]
+==== CatalogJanitor
+
+Periodically checks and cleans up the hbase:meta table.
+See <<arch.catalog.meta,arch.catalog.meta>> for more information on META.
+
+[[regionserver.arch]]
+== RegionServer
+
+[code]+HRegionServer+ is the RegionServer implementation.
+It is responsible for serving and managing regions.
+In a distributed cluster, a RegionServer runs on a <<arch.hdfs.dn,arch.hdfs.dn>>. 
+
+[[regionserver.arch.api]]
+=== Interface
+
+The methods exposed by [code]+HRegionRegionInterface+ contain both data-oriented and region-maintenance methods: 
+
+* Data (get, put, delete, next, etc.)
+* Region (splitRegion, compactRegion, etc.) For example, when the [code]+HBaseAdmin+ method [code]+majorCompact+ is invoked on a table, the client is actually iterating through all regions for the specified table and requesting a major compaction directly to each region. 
+
+[[regionserver.arch.processes]]
+=== Processes
+
+The RegionServer runs a variety of background threads:
+
+[[regionserver.arch.processes.compactsplit]]
+==== CompactSplitThread
+
+Checks for splits and handle minor compactions.
+
+[[regionserver.arch.processes.majorcompact]]
+==== MajorCompactionChecker
+
+Checks for major compactions.
+
+[[regionserver.arch.processes.memstore]]
+==== MemStoreFlusher
+
+Periodically flushes in-memory writes in the MemStore to StoreFiles.
+
+[[regionserver.arch.processes.log]]
+==== LogRoller
+
+Periodically checks the RegionServer's WAL.
+
+=== Coprocessors
+
+Coprocessors were added in 0.92.
+There is a thorough link:https://blogs.apache.org/hbase/entry/coprocessor_introduction[Blog Overview
+            of CoProcessors] posted.
+Documentation will eventually move to this reference guide, but the blog is the most current information available at this time. 
+
+[[block.cache]]
+=== Block Cache
+
+HBase provides two different BlockCache implementations: the default onheap LruBlockCache and BucketCache, which is (usually) offheap.
+This section discusses benefits and drawbacks of each implementation, how to choose the appropriate option, and configuration options for each.
+
+.Block Cache Reporting: UI
+[NOTE]
+====
+See the RegionServer UI for detail on caching deploy.
+Since HBase-0.98.4, the Block Cache detail has been significantly extended showing configurations, sizings, current usage, time-in-the-cache, and even detail on block counts and types.
+====
+
+==== Cache Choices
+
+[class]+LruBlockCache+ is the original implementation, and is entirely within the Java heap. [class]+BucketCache+ is mainly intended for keeping blockcache data offheap, although BucketCache can also keep data onheap and serve from a file-backed cache. 
+
+.BucketCache is production ready as of hbase-0.98.6
+[NOTE]
+====
+To run with BucketCache, you need HBASE-11678.
+This was included in hbase-0.98.6. 
+====          
+
+Fetching will always be slower when fetching from BucketCache, as compared to the native onheap LruBlockCache.
+However, latencies tend to be less erratic across time, because there is less garbage collection when you use BucketCache since it is managing BlockCache allocations, not the GC.
+If the BucketCache is deployed in offheap mode, this memory is not managed by the GC at all.
+This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs and heap fragmentation.
+See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 101] for comparisons running onheap vs offheap tests.
+Also see link:http://people.apache.org/~stack/bc/[Comparing BlockCache Deploys]            which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache. 
+
+When you enable BucketCache, you are enabling a two tier caching system, an L1 cache which is implemented by an instance of LruBlockCache and an offheap L2 cache which is implemented by BucketCache.
+Management of these two tiers and the policy that dictates how blocks move between them is done by [class]+CombinedBlockCache+.
+It keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and BLOOM blocks -- onheap in the L1 [class]+LruBlockCache+.
+See <<offheap.blockcache,offheap.blockcache>> for more detail on going offheap.
+
+[[cache.configurations]]
+==== General Cache Configurations
+
+Apart from the cache implementation itself, you can set some general configuration options to control how the cache performs.
+See link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html.
+After setting any of these options, restart or rolling restart your cluster for the configuration to take effect.
+Check logs for errors or unexpected behavior.
+
+See also <<blockcache.prefetch,blockcache.prefetch>>, which discusses a new option introduced in link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857].
+
+[[block.cache.design]]
+==== LruBlockCache Design
+
+The LruBlockCache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies: 
+
+* Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions.
+  The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage.
+* Mutli access priority: If a block in the previous priority group is accessed again, it upgrades to this priority.
+  It is thus part of the second group considered during evictions.
+* In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed.
+  Catalog tables are configured like this.
+  This group is the last one considered during evictions.
++
+To mark a column family as in-memory, call 
+
+[source,java]
+----
+HColumnDescriptor.setInMemory(true);
+---- 
+
+if creating a table from java, or set +IN_MEMORY => true+ when creating or altering a table in the shell: e.g.
+ 
+[source]
+----
+hbase(main):003:0> create  't', {NAME => 'f', IN_MEMORY => 'true'}
+----
+
+For more information, see the link:http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html[LruBlockCache
+              source]          
+
+[[block.cache.usage]]
+==== LruBlockCache Usage
+
+Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache.
+This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance.
+An important concept is the link:http://en.wikipedia.org/wiki/Working_set_size[working set size], or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time. 
+
+The way to calculate how much memory is available in HBase for caching is: 
+
+[source]
+----
+
+            number of region servers * heap size * hfile.block.cache.size * 0.99
+----
+
+The default value for the block cache is 0.25 which represents 25% of the available heap.
+The last value (99%) is the default acceptable loading factor in the LRU cache after which eviction is started.
+The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks.
+Here are some examples: 
+
+* One region server with the default heap size (1 GB) and the default block cache size will have 253 MB of block cache available.
+* 20 region servers with the heap size set to 8 GB and a default block cache size will have 39.6 of block cache.
+* 100 region servers with the heap size set to 24 GB and a block cache size of 0.5 will have about 1.16 TB of block cache.
+
+Your data is not the only resident of the block cache.
+Here are others that you may have to take into account: 
+
+Catalog Tables::
+  The [code]+-ROOT-+ (prior to HBase 0.96.
+  See <<arch.catalog.root,arch.catalog.root>>) and [code]+hbase:meta+ tables are forced into the block cache and have the in-memory priority which means that they are harder to evict.
+  The former never uses more than a few hundreds of bytes while the latter can occupy a few MBs (depending on the number of regions).
+
+HFiles Indexes::
+  An [firstterm]_hfile_ is the file format that HBase uses to store data in HDFS.
+  It contains a multi-layered index which allows HBase to seek to the data without having to read the whole file.
+  The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing.
+  For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.
+
+Keys::
+  The values that are stored are only half the picture, since each value is stored along with its keys (row key, family qualifier, and timestamp). See <<keysize,keysize>>.
+
+Bloom Filters::
+  Just like the HFile indexes, those data structures (when enabled) are stored in the LRU.
+
+Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics.
+For keys, sampling can be done by using the HFile command line tool and look for the average key size metric.
+Since HBase 0.98.3, you can view detail on BlockCache stats and metrics in a special Block Cache section in the UI.
+
+It's generally bad to use block caching when the WSS doesn't fit in memory.
+This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data.
+One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily.
+Here are two use cases: 
+
+* Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0.
+  Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM.
+  For more information on monitoring GC, see <<trouble.log.gc,trouble.log.gc>>.
+* Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache.
+  The Scan object has the option of turning this off via the setCaching method (set it to false). You can still keep block caching turned on on this table if you need fast random read access.
+  An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use. 
+
+[[data.blocks.in.fscache]]
+===== Caching META blocks only (DATA blocks in fscache)
+
+An interesting setup is one where we cache META blocks only and we read DATA blocks in on each access.
+If the DATA blocks fit inside fscache, this alternative may make sense when access is completely random across a very large dataset.
+To enable this setup, alter your table and for each column family set [var]+BLOCKCACHE => 'false'+.
+You are 'disabling' the BlockCache for this column family only you can never disable the caching of META blocks.
+Since link:https://issues.apache.org/jira/browse/HBASE-4683[HBASE-4683 Always cache index and bloom blocks], we will cache META blocks even if the BlockCache is disabled. 
+
+[[offheap.blockcache]]
+==== Offheap Block Cache
+
+[[enable.bucketcache]]
+===== How to Enable BucketCache
+
+The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache implemented by LruBlockCache and a second L2 cache implemented with BucketCache.
+The managing class is link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html[CombinedBlockCache] by default.
+The just-previous link describes the caching 'policy' implemented by CombinedBlockCache.
+In short, it works by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA blocks are kept in the L2, BucketCache tier.
+It is possible to amend this behavior in HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by setting [var]+cacheDataInL1+ via [code]+(HColumnDescriptor.setCacheDataInL1(true)+            or in the shell, creating or amending column families setting [var]+CACHE_DATA_IN_L1+            to true: e.g. 
+[source]
+----
+hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}
+----
+
+The BucketCache Block Cache can be deployed onheap, offheap, or file based.
+You set which via the [var]+hbase.bucketcache.ioengine+ setting.
+Setting it to [var]+heap+ will have BucketCache deployed inside the  allocated java heap.
+Setting it to [var]+offheap+ will have BucketCache make its allocations offheap, and an ioengine setting of [var]+file:PATH_TO_FILE+ will direct BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such as SSDs). 
+
+It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache policy and have BucketCache working as a strict L2 cache to the L1 LruBlockCache.
+For such a setup, set [var]+CacheConfig.BUCKET_CACHE_COMBINED_KEY+ to [literal]+false+.
+In this mode, on eviction from L1, blocks go to L2.
+When a block is cached, it is cached first in L1.
+When we go to look for a cached block, we look first in L1 and if none found, then search L2.
+Let us call this deploy format, 
+_(((Raw L1+L2)))_.
+
+Other BucketCache configs include: specifying a location to persist cache to across restarts, how many threads to use writing the cache, etc.
+See the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig.html]              class for configuration options and descriptions.
+
+
+
+====== BucketCache Example Configuration
+This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB onheap cache.
+
+Configuration is performed on the RegionServer.
+
+Setting [var]+hbase.bucketcache.ioengine+ and  [var]+hbase.bucketcache.size+ > 0 enables CombinedBlockCache.
+Let us presume that the RegionServer has been set to run with a 5G heap: i.e.
+HBASE_HEAPSIZE=5g. 
+
+
+. First, edit the RegionServer's [path]_hbase-env.sh_ and set [var]+HBASE_OFFHEAPSIZE+ to a value greater than the offheap size wanted, in this case, 4 GB (expressed as 4G).  Lets set it to 5G.
+  That'll be 4G for our offheap cache and 1G for any other uses of offheap memory (there are other users of offheap memory other than BlockCache; e.g.
+  DFSClient  in RegionServer can make use of offheap memory). See <<direct.memory,direct.memory>>.
+  +
+[source]
+----
+HBASE_OFFHEAPSIZE=5G
+----
+
+. Next, add the following configuration to the RegionServer's [path]_hbase-site.xml_.
++
+[source,xml]
+----
+<property>
+  <name>hbase.bucketcache.ioengine</name>
+  <value>offheap</value>
+</property>
+<property>
+  <name>hfile.block.cache.size</name>
+  <value>0.2</value>
+</property>
+<property>
+  <name>hbase.bucketcache.size</name>
+  <value>4196</value>
+</property>
+----
+
+. Restart or rolling restart your cluster, and check the logs for any issues.
+
+
+In the above, we set bucketcache to be 4G.
+The onheap lrublockcache we configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G). In other words, you configure the L1 LruBlockCache as you would normally, as you would when there is no L2 BucketCache present. 
+
+link:https://issues.apache.org/jira/browse/HBASE-10641[HBASE-10641] introduced the ability to configure multiple sizes for the buckets of the bucketcache, in HBase 0.98 and newer.
+To configurable multiple bucket sizes, configure the new property +hfile.block.cache.sizes+ (instead of +hfile.block.cache.size+) to a comma-separated list of block sizes, ordered from smallest to largest, with no spaces.
+The goal is to optimize the bucket sizes based on your data access patterns.
+The following example configures buckets of size 4096 and 8192.
+
+----
+
+<property>
+  <name>hfile.block.cache.sizes</name>
+  <value>4096,8192</value>
+</property>
+----
+
+.Direct Memory Usage In HBase
+[NOTE]
+====
+The default maximum direct memory varies by JVM.
+Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will allocate direct memory buffers.
+If you do offheap block caching, you'll be making use of direct memory.
+Starting your JVM, make sure the [var]+-XX:MaxDirectMemorySize+ setting in [path]_conf/hbase-env.sh_ is set to some value that is higher than what you have allocated to your offheap blockcache ([var]+hbase.bucketcache.size+).  It should be larger than your offheap block cache and then some for DFSClient usage (How much the DFSClient uses is not easy to quantify; it is the number of open hfiles * [var]+hbase.dfs.client.read.shortcircuit.buffer.size+                    where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see [path]_hbase-default.xml_                    default configurations). Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx.
+The value allocated by MaxDirectMemorySize must not exceed physical RAM, and is likely to be less than the total available RAM due to other memory requirements and system constraints. 
+
+You can see how much memory -- onheap and offheap/direct -- a RegionServer is configured to use and how much it is using at any one time by looking at the _Server Metrics: Memory_ tab in the UI.
+It can also be gotten via JMX.
+In particular the direct memory currently used by the server can be found on the [var]+java.nio.type=BufferPool,name=direct+ bean.
+Terracotta has a link:http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options[good write up] on using offheap memory in java.
+It is for their product BigMemory but alot of the issues noted apply in general to any attempt at going offheap.
+Check it out.
+====
+
+.hbase.bucketcache.percentage.in.combinedcache
+[NOTE]
+====
+This is a pre-HBase 1.0 configuration removed because it was confusing.
+It was a float that you would set to some value between 0.0 and 1.0.
+Its default was 0.9.
+If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was calculated to be (1 - [var]+hbase.bucketcache.percentage.in.combinedcache+) * [var]+size-of-bucketcache+  and the BucketCache size was [var]+hbase.bucketcache.percentage.in.combinedcache+ * size-of-bucket-cache.
+where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size IF it was specified as megabytes OR [var]+hbase.bucketcache.size+ * [var]+-XX:MaxDirectMemorySize+ if [var]+hbase.bucketcache.size+ between 0 and 1.0. 
+
+In 1.0, it should be more straight-forward.
+L1 LruBlockCache size is set as a fraction of java heap using hfile.block.cache.size setting (not the best name) and L2 is set as above either in absolute megabytes or as a fraction of allocated maximum direct memory. 
+====
+
+==== Comprewssed Blockcache
+
+link:https://issues.apache.org/jira/browse/HBASE-11331[HBASE-11331] introduced lazy blockcache decompression, more simply referred to as compressed blockcache.
+When compressed blockcache is enabled.
+data and encoded data blocks are cached in the blockcache in their on-disk format, rather than being decompressed and decrypted before caching.
+
+For a RegionServer hosting more data than can fit into cache, enabling this feature with SNAPPY compression has been shown to result in 50% increase in throughput and 30% improvement in mean latency while, increasing garbage collection by 80% and increasing overall CPU load by 2%. See HBASE-11331 for more details about how performance was measured and achieved.
+For a RegionServer hosting data that can comfortably fit into cache, or if your workload is sensitive to extra CPU or garbage-collection load, you may receive less benefit.
+
+Compressed blockcache is disabled by default.
+To enable it, set [code]+hbase.block.data.cachecompressed+ to [code]+true+ in [path]_hbase-site.xml_ on all RegionServers.
+
+[[wal]]
+=== Write Ahead Log (WAL)
+
+[[purpose.wal]]
+==== Purpose
+
+The [firstterm]_Write Ahead Log (WAL)_ records all changes to data in HBase, to file-based storage.
+Under normal operations, the WAL is not needed because data changes move from the MemStore to StoreFiles.
+However, if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
+If writing to the WAL fails, the entire operation to modify the data fails.
+
+HBase uses an implementation of the link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html[WAL] interface.
+Usually, there is only one instance of a WAL per RegionServer.
+The RegionServer records Puts and Deletes to it, before recording them to the <<store.memstore,store.memstore>> for the affected <<store,store>>. 
+
+.The HLog
+[NOTE]
+====
+Prior to 2.0, the interface for WALs in HBase was named [class]+HLog+.
+In 0.94, HLog was the name of the implementation of the WAL.
+You will likely find references to the HLog in documentation tailored to these older versions. 
+====
+
+The WAL resides in HDFS in the [path]_/hbase/WALs/_ directory (prior to HBase 0.94, they were stored in [path]_/hbase/.logs/_), with subdirectories per region.
+
+For more general information about the concept of write ahead logs, see the Wikipedia link:http://en.wikipedia.org/wiki/Write-ahead_logging[Write-Ahead Log]            article. 
+
+[[wal_flush]]
+==== WAL Flushing
+
+TODO (describe). 
+
+==== WAL Splitting
+
+A RegionServer serves many regions.
+All of the regions in a region server share the same active WAL file.
+Each edit in the WAL file includes information about which region it belongs to.
+When a region is opened, the edits in the WAL file which belong to that region need to be replayed.
+Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region.
+The process of grouping the WAL edits by region is called [firstterm]_log
+              splitting_.
+It is a critical process for recovering data if a region server fails.
+
+Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler as a region server shuts down.
+So that consistency is guaranteed, affected regions are unavailable until data is restored.
+All WAL edits need to be recovered and replayed before a given region can become available again.
+As a result, regions affected by log splitting are unavailable until the process completes.
+
+.Procedure: Log Splitting, Step by Step
+. The [path]_/hbase/WALs/<host>,<port>,<startcode>_ directory is renamed.
++
+Renaming the directory is important because a RegionServer may still be up and accepting requests even if the HMaster thinks it is down.
+If the RegionServer does not respond immediately and does not heartbeat its ZooKeeper session, the HMaster may interpret this as a RegionServer failure.
+Renaming the logs directory ensures that existing, valid WAL files which are still in use by an active but busy RegionServer are not written to by accident.
++
+The new directory is named according to the following pattern:
++
+----
+/hbase/WALs/<host>,<port>,<startcode>-splitting
+----
++
+An example of such a renamed directory might look like the following:
++
+----
+/hbase/WALs/srv.example.com,60020,1254173957298-splitting
+----
+
+. Each log file is split, one at a time.
++
+The log splitter reads the log file one edit entry at a time and puts each edit entry into the buffer corresponding to the edit's region.
+At the same time, the splitter starts several writer threads.
+Writer threads pick up a corresponding buffer and write the edit entries in the buffer to a temporary recovered edit file.
+The temporary edit file is stored to disk with the following naming pattern:
++
+----
+/hbase/<table_name>/<region_id>/recovered.edits/.temp
+----
++
+This file is used to store all the edits in the WAL log for this region.
+After log splitting completes, the [path]_.temp_ file is renamed to the sequence ID of the first log written to the file.
++
+To determine whether all edits have been written, the sequence ID is compared to the sequence of the last edit that was written to the HFile.
+If the sequence of the last edit is greater than or equal to the sequence ID included in the file name, it is clear that all writes from the edit file have been completed.
+
+. After log splitting is complete, each affected region is assigned to a
+  RegionServer.
++
+When the region is opened, the [path]_recovered.edits_ folder is checked for recovered edits files.
+If any such files are present, they are replayed by reading the edits and saving them to the MemStore.
+After all edit files are replayed, the contents of the MemStore are written to disk (HFile) and the edit files are deleted.
+
+
+===== Handling of Errors During Log Splitting
+
+If you set the [var]+hbase.hlog.split.skip.errors+ option to [constant]+true+, errors are treated as follows:
+
+* Any error encountered during splitting will be logged.
+* The problematic WAL log will be moved into the [path]_.corrupt_                  directory under the hbase [var]+rootdir+,
+* Processing of the WAL will continue
+
+If the [var]+hbase.hlog.split.skip.errors+ optionset to [literal]+false+, the default, the exception will be propagated and the split will be logged as failed.
+See link:https://issues.apache.org/jira/browse/HBASE-2958[HBASE-2958 When
+                    hbase.hlog.split.skip.errors is set to false, we fail the split but thats
+                    it].
+We need to do more than just fail split if this flag is set.
+
+====== How EOFExceptions are treated when splitting a crashed RegionServers'WALs
+
+If an EOFException occurs while splitting logs, the split proceeds even when [var]+hbase.hlog.split.skip.errors+ is set to [literal]+false+.
+An EOFException while reading the last log in the set of files to split is likely, because the RegionServer is likely to be in the process of writing a record at the time of a crash.
+For background, see link:https://issues.apache.org/jira/browse/HBASE-2643[HBASE-2643
+                      Figure how to deal with eof splitting logs]
+
+===== Performance Improvements during Log Splitting
+
+WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. <<distributed.log.splitting,distributed.log.splitting>> and <<distributed.log.replay,distributed.log.replay>> were developed to improve performance during log splitting. 
+
+[[distributed.log.splitting]]
+====== Distributed Log Splitting
+
+[firstterm]_Distributed Log Splitting_ was added in HBase version 0.92 (link:https://issues.apache.org/jira/browse/HBASE-1364[HBASE-1364])  by Prakash Khemani from Facebook.
+It reduces the time to complete log splitting dramatically, improving the availability of regions and tables.
+For example, recovering a crashed cluster took around 9 hours with single-threaded log splitting, but only about six minutes with distributed log splitting.
+
+The information in this section is sourced from Jimmy Xiang's blog post at link:http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/.
+
+.Enabling or Disabling Distributed Log Splitting
+
+Distributed log processing is enabled by default since HBase 0.92.
+The setting is controlled by the +hbase.master.distributed.log.splitting+                  property, which can be set to [literal]+true+ or [literal]+false+, but defaults to [literal]+true+. 
+
+[[log.splitting.step.by.step]]
+.Distributed Log Splitting, Step by Step
+
+After configuring distributed log splitting, the HMaster controls the process.
+The HMaster enrolls each RegionServer in the log splitting process, and the actual work of splitting the logs is done by the RegionServers.
+The general process for log splitting, as described in <<log.splitting.step.by.step,log.splitting.step.by.step>> still applies here.
+
+. If distributed log processing is enabled, the HMaster creates a [firstterm]_split log manager_ instance when the cluster is started.
+  .. The split log manager manages all log files which need to be scanned and split.
+  .. The split log manager places all the logs into the ZooKeeper splitlog node ([path]_/hbase/splitlog_) as tasks.
+  .. You can view the contents of the splitlog by issuing the following +zkcli+ command. Example output is shown.
++
+----
+ls /hbase/splitlog
+[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]
+----
++
+The output contains some non-ASCII characters.
+When decoded, it looks much more simple:
++
+----
+[hdfs://host2.sample.com:56020/hbase/.logs
+/host8.sample.com,57020,1340474893275-splitting
+/host8.sample.com%3A57020.1340474893900, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host3.sample.com,57020,1340474893299-splitting
+/host3.sample.com%3A57020.1340474893931, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host4.sample.com,57020,1340474893287-splitting
+/host4.sample.com%3A57020.1340474893946]
+----
++
+The listing represents WAL file names to be scanned and split, which is a list of log splitting tasks.
+
+. The split log manager monitors the log-splitting tasks and workers.
++
+The split log manager is responsible for the following ongoing tasks:
++
+* Once the split log manager publishes all the tasks to the splitlog znode, it monitors these task nodes and waits for them to be processed.
+* Checks to see if there are any dead split log workers queued up.
+  If it finds tasks claimed by unresponsive workers, it will resubmit those tasks.
+  If the resubmit fails due to some ZooKeeper exception, the dead worker is queued up again for retry.
+* Checks to see if there are any unassigned tasks.
+  If it finds any, it create an ephemeral rescan node so that each split log worker is notified to re-scan unassigned tasks via the [code]+nodeChildrenChanged+ ZooKeeper event.
+* Checks for tasks which are assigned but expired.
+  If any are found, they are moved back to [code]+TASK_UNASSIGNED+ state again so that they can be retried.
+  It is possible that these tasks are assigned to slow workers, or they may already be finished.
+  This is not a problem, because log splitting tasks have the property of idempotence.
+  In other words, the same log splitting task can be processed many times without causing any problem.
+* The split log manager watches the HBase split log znodes constantly.
+  If any split log task node data is changed, the split log manager retrieves the node data.
+  The node data contains the current state of the task.
+  You can use the +zkcli+ +get+ command to retrieve the current state of a task.
+  In the example output below, the first line of the output shows that the task is currently unassigned.
++
+----
+get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
+ 
+unassigned host2.sample.com:57000
+cZxid = 0×7115
+ctime = Sat Jun 23 11:13:40 PDT 2012
+...
+----
++
+Based on the state of the task whose data is changed, the split log manager does one of the following:
++
+* Resubmit the task if it is unassigned
+* Heartbeat the task if it is assigned
+* Resubmit or fail the task if it is resigned (see <<distributed.log.replay.failure.reasons,distributed.log.replay.failure.reasons>>)
+* Resubmit or fail the task if it is completed with errors (see <<distributed.log.replay.failure.reasons,distributed.log.replay.failure.reasons>>)
+* Resubmit or fail the task if it could not complete due to errors (see <<distributed.log.replay.failure.reasons,distributed.log.replay.failure.reasons>>)
+* Delete the task if it is successfully completed or failed
++
+* .Reasons a Task Will FailThe task has been deleted.
+* The node no longer exists.
+* The log status manager failed to move the state of the task to TASK_UNASSIGNED.
+* The number of resubmits is over the resubmit threshold.
+
+
+. Each RegionServer's split log worker performs the log-splitting tasks.
++
+Each RegionServer runs a daemon thread called the [firstterm]_split log
+                      worker_, which does the work to split the logs.
+The daemon thread starts when the RegionServer starts, and registers itself to watch HBase znodes.
+If any splitlog znode children change, it notifies a sleeping worker thread to wake up and grab more tasks.
+If if a worker's current task's node data is changed, the worker checks to see if the task has been taken by another worker.
+If so, the worker thread stops work on the current task.
++
+The worker monitors the splitlog znode constantly.
+When a new task appears, the split log worker retrieves  the task paths and checks each one until it finds an unclaimed task, which it attempts to claim.
+If the claim was successful, it attempts to perform the task and updates the task's +state+ property based on the splitting outcome.
+At this point, the split log worker scans for another unclaimed task.
++
+* .How the Split Log Worker Approaches a TaskIt queries the task state and only takes action if the task is in [literal]+TASK_UNASSIGNED +state.
+* If the task is is in [literal]+TASK_UNASSIGNED+ state, the worker attempts to set the state to [literal]+TASK_OWNED+ by itself.
+  If it fails to set the state, another worker will try to grab it.
+  The split log manager will also ask all workers to rescan later if the task remains unassigned.
+* If the worker succeeds in taking ownership of the task, it tries to get the task state again to make sure it really gets it asynchronously.
+  In the meantime, it starts a split task executor to do the actual work: 
++
+* Get the HBase root folder, create a temp folder under the root, and split the log file to the temp folder.
+* If the split was successful, the task executor sets the task to state [literal]+TASK_DONE+.
+* If the worker catches an unexpected IOException, the task is set to state [literal]+TASK_ERR+.
+* If the worker is shutting down, set the the task to state [literal]+TASK_RESIGNED+.
+* If the task is taken by another worker, just log it.
+
+
+. The split log manager monitors for uncompleted tasks.
++
+The split log manager returns when all tasks are completed successfully.
+If all tasks are completed with some failures, the split log manager throws an exception so that the log splitting can be retried.
+Due to an asynchronous implementation, in very rare cases, the split log manager loses track of some completed tasks.
+For that reason, it periodically checks for remaining uncompleted task in its task map or ZooKeeper.
+If none are found, it throws an exception so that the log splitting can be retried right away instead of hanging there waiting for something that won't happen.
+
+
+[[distributed.log.replay]]
+====== Distributed Log Replay
+
+After a RegionServer fails, its failed region is assigned to another RegionServer, which is marked as "recovering" in ZooKeeper.
+A split log worker directly replays edits from the WAL of the failed region server to the region at its new location.
+When a region is in "recovering" state, it can accept writes but no reads (including Append and Increment), region splits or merges. 
+
+Distributed Log Replay extends the <<distributed.log.splitting,distributed.log.splitting>> framework.
+It works by directly replaying WAL edits to another RegionServer instead of creating [path]_recovered.edits_ files.
+It provides the following advantages over distributed log splitting alone:
+
+* It eliminates the overhead of writing and reading a large number of [path]_recovered.edits_ files.
+  It is not unusual for thousands of [path]_recovered.edits_ files to be created and written concurrently during a RegionServer recovery.
+  Many small random writes can degrade overall system performance.
+* It allows writes even when a region is in recovering state.
+  It only takes seconds for a recovering region to accept writes again.
+
+.Enabling Distributed Log Replay
+To enable distributed log replay, set [var]+hbase.master.distributed.log.replay+ to true.
+This will be the default for HBase 0.99 (link:https://issues.apache.org/jira/browse/HBASE-10888[HBASE-10888]).
+
+You must also enable HFile version 3 (which is the default HFile format starting in HBase 0.99.
+See link:https://issues.apache.org/jira/browse/HBASE-10855[HBASE-10855]). Distributed log replay is unsafe for rolling upgrades.
+
+[[wal.disable]]
+==== Disabling the WAL
+
+It is possible to disable the WAL, to improve performace in certain specific situations.
+However, disabling the WAL puts your data at risk.
+The only situation where this is recommended is during a bulk load.
+This is because, in the event of a problem, the bulk load can be re-run with no risk of data loss.
+
+The WAL is disabled by calling the HBase client field [code]+Mutation.writeToWAL(false)+.
+Use the [code]+Mutation.setDurability(Durability.SKIP_WAL)+ and Mutation.getDurability() methods to set and get the field's value.
+There is no way to disable the WAL for only a specific table.
+
+WARNING: If you disable the WAL for anything other than bulk loads, your data is at risk.
+
+[[regions.arch]]
+== Regions
+
+Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family.
+The heirarchy of objects is as follows: 
+
+[source]
+----
+Table       (HBase table)
+    Region       (Regions for the table)
+         Store          (Store per ColumnFamily for each Region for the table)
+              MemStore           (MemStore for each Store for each Region for the table)
+              StoreFile          (StoreFiles for each Store for each Region for the table)
+                    Block             (Blocks within a StoreFile within a Store for each Region for the table)
+----     
+
+For a description of what HBase files look like when written to HDFS, see <<trouble.namenode.hbase.objects,trouble.namenode.hbase.objects>>. 
+
+[[arch.regions.size]]
+=== Considerations for Number of Regions
+
+In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per server.
+The considerations for this are as follows:
+
+[[too_many_regions]]
+==== Why cannot I have too many regions?
+
+Typically you want to keep your region count low on HBase for numerous reasons.
+Usually right around 100 regions per RegionServer has yielded the best results.
+Here are some of the reasons below for keeping region count low:
+
+. MSLAB requires 2mb per memstore (that's 2mb per family per region). 1000 regions that have 2 families each is 3.9GB of heap used, and it's not even storing data yet.
+  NB: the 2MB value is configurable. 
+. If you fill all the regions at somewhat the same rate, the global memory usage makes it that it forces tiny flushes when you have too many regions which in turn generates compactions.
+  Rewriting the same data tens of times is the last thing you want.
+  An example is filling 1000 regions (with one family) equally and let's consider a lower bound for global memstore usage of 5GB (the region server would have a big heap). Once it reaches 5GB it will force flush the biggest region, at that point they should almost all have about 5MB of data so it would flush that amount.
+  5MB inserted later, it would flush another region that will now have a bit over 5MB of data, and so on.
+  This is currently the main limiting factor for the number of regions; see <<ops.capacity.regions.count,ops.capacity.regions.count>>                          for detailed formula. 
+. The master as is is allergic to tons of regions, and will take a lot of time assigning them and moving them around in batches.
+  The reason is that it's heavy on ZK usage, and it's not very async at the moment (could really be improved -- and has been imporoved a bunch in 0.96 hbase). 
+. In older versions of HBase (pre-v2 hfile, 0.90 and previous), tons of regions on a few RS can cause the store file index to rise, increasing heap usage and potentially creating memory pressure or OOME on the RSs 
+
+Another issue is the effect of the number of regions on mapreduce jobs; it is typical to have one mapper per HBase region.
+Thus, hosting only 5 regions per RS may not be enough to get sufficient number of tasks for a mapreduce job, while 1000 regions will generate far too many tasks. 
+
+See <<ops.capacity.regions,ops.capacity.regions>> for configuration guidelines.
+
+[[regions.arch.assignment]]
+=== Region-RegionServer Assignment
+
+This section describes how Regions are assigned to RegionServers. 
+
+[[regions.arch.assignment.startup]]
+==== Startup
+
+When HBase starts regions are assigned as follows (short version): 
+
+. The Master invokes the [code]+AssignmentManager+ upon startup.
+. The [code]+AssignmentManager+ looks at the existing region assignments in META.
+. If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept.
+. If the assignment is invalid, then the [code]+LoadBalancerFactory+ is invoked to assign the region.
+  The [code]+DefaultLoadBalancer+ will randomly assign the region to a RegionServer.
+. META is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer.          
+
+[[regions.arch.assignment.failover]]
+==== Failover
+
+When a RegionServer fails: 
+
+. The regions immediately become unavailable because the RegionServer is down.
+. The Master will detect that the RegionServer has failed.
+. The region assignments will be considered invalid and will be re-assigned just like the startup sequence.
+. In-flight queries are re-tried, and not lost.
+. Operations are switched to a new RegionServer within the following amount of time:
++
+[source]
+----
+ZooKeeper session timeout + split time + assignment/replay time
+----
+          
+
+[[regions.arch.balancer]]
+==== Region Load Balancing
+
+Regions can be periodically moved by the <<master.processes.loadbalancer,master.processes.loadbalancer>>. 
+
+[[regions.arch.states]]
+==== Region State Transition
+
+HBase maintains a state for each region and persists the state in META.
+The state of the META region itself is persisted in ZooKeeper.
+You can see the states of regions in transition in the Master web UI.
+Following is the list of possible region states.
+
+* .Possible Region StatesOFFLINE: the region is offline and not opening
+* OPENING: the region is in the process of being opened
+* OPEN: the region is open and the region server has notified the master
+* FAILED_OPEN: the region server failed to open the region
+* CLOSING: the region is in the process of being closed
+* CLOSED: the region server has closed the region and notified the master
+* FAILED_CLOSE: the region server failed to close the region
+* SPLITTING: the region server notified the master that the region is splitting
+* SPLIT: the region server notified the master that the region has finished splitting
+* SPLITTING_NEW: this region is being created by a split which is in progress
+* MERGING: the region server notified the master that this region is being merged with another region
+* MERGED: the region server notified the master that this region has been merged
+* MERGING_NEW: this region is being created by a merge of two regions
+
+.Region State Transitions
+image::region_states.png[]
+
+.Graph Legend
+* Brown: Offline state, a special state that can be transient (after closed before opening), terminal (regions of disabled tables), or initial (regions of newly created tables)
+* Palegreen: Online state that regions can serve requests
+* Lightblue: Transient states
+* Red: Failure states that need OPS attention
+* Gold: Terminal states of regions split/merged
+* Grey: Initial states of regions created through split/merge
+
+.Transition State Descriptions
+. The master moves a region from [literal]+OFFLINE+ to [literal]+OPENING+ state and tries to assign the region to a region server.
+  The region server may or may not have received the open region request.
+  The master retries sending the open region request to the region server until the RPC goes through or the master runs out of retries.
+  After the region server receives the open region request, the region server begins opening the region.
+. If the master is running out of retries, the master prevents the region server from opening the region by moving the region to [literal]+CLOSING+ state and trying to close it, even if the region server is starting to open the region.
+. After the region server opens the region, it continues to try to notify the master until the master moves the region to [literal]+OPEN+ state and notifies the region server.
+  The region is now open.
+. If the region server cannot open the region, it notifies the master.
+  The master moves the region to [literal]+CLOSED+ state and tries to open the region on a different region server.
+. If the master cannot open the region on any of a certain number of regions, it moves the region to [literal]+FAILED_OPEN+ state, and takes no further action until an operator intervenes from the HBase shell, or the server is dead.
+. The master moves a region from [literal]+OPEN+ to [literal]+CLOSING+ state.
+  The region server holding the region may or may not have received the close region request.
+  The master retries sending the close request to the server until the RPC goes through or the master runs out of retries.
+. If the region server is not online, or throws [code]+NotServingRegionException+, the master moves the region to [literal]+OFFLINE+ state and re-assigns it to a different region server.
+. If the region server is online, but not reachable after the master runs out of retries, the master moves the region to [literal]+FAILED_CLOSE+ state and takes no further action until an operator intervenes from the HBase shell, or the server is dead.
+. If the region server gets the close region request, it closes the region and notifies the master.
+  The master moves the region to [literal]+CLOSED+ state and re-assigns it to a different region server.
+. Before assigning a region, the master moves the region to [literal]+OFFLINE+ state automatically if it is in [literal]+CLOSED+ state.
+. When a region server is about to split a region, it notifies the master.
+  The master moves the region to be split from [literal]+OPEN+ to [literal]+SPLITTING+ state and add the two new regions to be created to the region server.
+  These two regions are in [literal]+SPLITING_NEW+ state initially.
+. After notifying the master, the region server starts to split the region.
+  Once past the point of no return, the region server notifies the master again so the master can update the META.
+  However, the master does not update the region states until it is notified by the server that the split is done.
+  If the split is successful, the splitting region is moved from [literal]+SPLITTING+ to [literal]+SPLIT+ state and the two new regions are moved from [literal]+SPLITTING_NEW+ to [literal]+OPEN+ state.
+. If the split fails, the splitting region is moved from [literal]+SPLITTING+ back to [literal]+OPEN+ state, and the two new regions which were created are moved from [literal]+SPLITTING_NEW+ to [literal]+OFFLINE+ state.
+. When a region server is about to merge two regions, it notifies the master first.
+  The master moves the two regions to be merged from [literal]+OPEN+ to [literal]+MERGING+state, and adds the new region which will hold the contents of the merged regions region to the region server.
+  The new region is in [literal]+MERGING_NEW+ state initially.
+. After notifying the master, the region server starts to merge the two regions.
+  Once past the point of no return, the region server notifies the master again so the master can update the META.
+  However, the master does not update the region states until it is notified by the region server that the merge has completed.
+  If the merge is successful, the two merging regions are moved from [literal]+MERGING+ to [literal]+MERGED+ state and the new region is moved from [literal]+MERGING_NEW+ to [literal]+OPEN+ state.
+. If the merge fails, the two merging regions are moved from [literal]+MERGING+ back to [literal]+OPEN+ state, and the new region which was created to hold the contents of the merged regions is moved from [literal]+MERGING_NEW+ to [literal]+OFFLINE+ state.
+. For regions in [literal]+FAILED_OPEN+ or [literal]+FAILED_CLOSE+                states , the master tries to close them again when they are reassigned by an operator via HBase Shell. 
+
+[[regions.arch.locality]]
+=== Region-RegionServer Locality
+
+Over time, Region-RegionServer locality is achieved via HDFS block replication.
+The HDFS client does the following by default when choosing locations to write replicas:
+
+. First replica is written to local node
+. Second replica is written to a random node on another rack
+. Third replica is written on the same rack as the second, but on a different node chosen randomly
+. Subsequent replicas are written on random nodes on the cluster.
+  See _Replica Placement: The First Baby Steps_ on this page: link:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html[HDFS Architecture]
+
+Thus, HBase eventually achieves locality for a region after a flush or a compaction.
+In a RegionServer failover situation a RegionServer may be assigned regions with non-local StoreFiles (because none of the replicas are local), however as new data is written in the region, or the table is compacted and StoreFiles are re-written, they will become "local" to the RegionServer. 
+
+For more information, see _Replica Placement: The First Baby Steps_ on this page: link:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html[HDFS Architecture]        and also Lars George's blog on link:http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html[HBase and HDFS locality]. 
+
+[[arch.region.splits]]
+=== Region Splits
+
+Regions split when they reach a configured threshold.
+Below we treat the topic in short.
+For a longer exposition, see link:http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/[Apache HBase Region Splitting and Merging]        by our Enis Soztutar. 
+
+Splits run unaided on the RegionServer; i.e.
+the Master does not participate.
+The RegionServer splits a region, offlines the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master.
+See <<disable.splitting,disable.splitting>> for how to manually manage splits (and for why you might do this)
+
+==== Custom Split Policies
+
+The default split policy can be overwritten using a custom link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.html[RegionSplitPolicy] (HBase 0.94+). Typically a custom split policy should extend HBase's default split policy: link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/ConstantSizeRegionSplitPolicy.html[ConstantSizeRegionSplitPolicy]. 
+
+The policy can set globally through the HBaseConfiguration used or on a per table basis: 
+[source,java]
+----
+
+HTableDescriptor myHtd = ...;
+myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName());
+----          
+
+[[manual_region_splitting_decisions]]
+=== Manual Region Splitting
+
+It is possible to manually split your table, either at table creation (pre-splitting), or at a later time as an administrative action.
+You might choose to split your region for one or more of the following reasons.
+There may be other valid reasons, but the need to manually split your table might also point to problems with your schema design.
+
+* .Reasons to Manually Split Your TableYour data is sorted by timeseries or another similar algorithm that sorts new data at the end of the table.
+  This means that the Region Server holding the last region is always under load, and the other Region Servers are idle, or mostly idle.
+  See also <<timeseries,timeseries>>.
+* You have developed an unexpected hotspot in one region of your table.
+  For instance, an application which tracks web searches might be inundated by a lot of searches for a celebrity in the event of news about that celebrity.
+  See <<perf.one.region,perf.one.region>> for more discussion about this particular scenario.
+* After a big increase to the number of Region Servers in your cluster, to get the load spread out quickly.
+* Before a bulk-load which is likely to cause unusual and uneven load across regions.
+
+See <<disable.splitting,disable.splitting>> for a discussion about the dangers and possible benefits of managing splitting completely manually.
+
+==== Determining Split Points
+
+The goal of splitting your table manually is to improve the chances of balancing the load across the cluster in situations where good rowkey design alone won't get you there.
+Keeping that in mind, the way you split your regions is very dependent upon the characteristics of your data.
+It may be that you already know the best way to split your table.
+If not, the way you split your table depends on what your keys are like.
+
+Alphanumeric Rowkeys::
+  If your rowkeys start with a letter or number, you can split your table at letter or number boundaries.
+  For instance, the following command creates a table with regions that split at each vowel, so the first region has A-D, the second region has E-H, the third region has I-N, the fourth region has O-V, and the fifth region has U-Z.
+
+Using a Custom Algorithm::
+  The RegionSplitter tool is provided with HBase, and uses a [firstterm]_SplitAlgorithm_ to determine split points for you.
+  As parameters, you give it the algorithm, desired number of regions, and column families.
+  It includes two split algorithms.
+  The first is the [code]+HexStringSplit+ algorithm, which assumes the row keys are hexadecimal strings.
+  The second, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/RegionSplitter.UniformSplit.html[UniformSplit], assumes the row keys are random byte arrays.
+  You will probably need to develop your own SplitAlgorithm, using the provided ones as models. 
+
+=== Online Region Merges
+
+Both Master and Regionserver participate in the event of online region merges.
+Client sends merge RPC to master, then master moves the regions together to the same regionserver where the more heavily loaded region resided, finally master send merge request to this regionserver and regionserver run the region merges.
+Similar with process of region splits, region merges run as a local transaction on the regionserver, offlines the regions and then merges two regions on the file system, atomically delete merging regions from META and add merged region to the META, opens merged region on the regionserver and reports the merge to Master at last. 
+
+An example of region merges in the hbase shell 
+[source,bourne]
+----
+$ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME'
+          hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true
+----          
+It's an asynchronous operation and call returns immediately without waiting merge completed.
+Passing 'true' as the optional third parameter will force a merge ('force' merges regardless else merge will fail unless passed adjacent regions.
+'force' is for expert use only) 
+
+=== Store
+
+A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region. 
+
+[[store.memstore]]
+==== MemStore
+
+The MemStore holds in-memory modifications to the Store.
+Modifications are Cells/KeyValues.
+When a flush is requested, the current memstore is moved to a snapshot and is cleared.
+HBase continues to serve edits from the new memstore and backing snapshot until the flusher reports that the flush succeeded.
+At this point, the snapshot is discarded.
+Note that when the flush happens, Memstores that belong to the same region will all be flushed.
+
+==== MemStoreFlush
+
+A MemStore flush can be triggered under any of the conditions listed below.
+The minimum flush unit is per region, not at individual MemStore level.
+
+. When a MemStore reaches the value specified by [var]+hbase.hregion.memstore.flush.size+, all MemStores that belong to its region will be flushed out to disk.
+. When overall memstore usage reaches the value specified by [var]+hbase.regionserver.global.memstore.upperLimit+, MemStores from various regions will be flushed out to disk to reduce overall MemStore usage in a Region Server.
+  The flush order is based on the descending order of a region's MemStore usage.
+  Regions will have their MemStores flushed until the overall MemStore usage drops to or slightly below [var]+hbase.regionserver.global.memstore.lowerLimit+. 
+. When the number of WAL per region server reaches the value specified in [var]+hbase.regionserver.max.logs+, MemStores from various regions will be flushed out to disk to reduce WAL count.
+  The flush order is based on time.
+  Regions with the oldest MemStores are flushed first until WAL count drops below [var]+hbase.regionserver.max.logs+. 
+
+[[hregion.scans]]
+==== Scans
+
+* When a client issues a scan against a table, HBase generates [code]+RegionScanner+ objects, one per region, to serve the scan request. 
+* The [code]+RegionScanner+ object contains a list of [code]+StoreScanner+ objects, one per column family. 
+* Each [code]+StoreScanner+ object further contains a list of [code]+StoreFileScanner+ objects, corresponding to each StoreFile and HFile of the corresponding column family, and a list of [code]+KeyValueScanner+ objects for the MemStore. 
+* The two lists are merge into one, which is sorted in ascending order with the scan object for the MemStore at the end of the list.
+* When a [code]+StoreFileScanner+ object is constructed, it is associated with a [code]+MultiVersionConsistencyControl+ read point, which is the current [code]+memstoreTS+, filtering out any new updates beyond the read point. 
+
+[[hfile]]
+==== StoreFile (HFile)
+
+StoreFiles are where your data lives. 
+
+===== HFile Format
+
+The _hfile_ file format is based on the SSTable file described in the link:http://research.google.com/archive/bigtable.html[BigTable [2006]] paper and on Hadoop's link:http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/file/tfile/TFile.html[tfile]              (The unit test suite and the compression harness were taken directly from tfile). Schubert Zhang's blog post on link:http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html[HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs] makes for a thorough introduction to HBase's hfile.
+Matteo Bertozzi has also put up a helpful description, link:http://th30z.blogspot.com/2011/02/hbase-io-hfile.html?spref=tw[HBase I/O: HFile]. 
+
+For more information, see the link:http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFile.html[HFile source code].
+Also see <<hfilev2,hfilev2>> for information about the HFile v2 format that was included in 0.92. 
+
+===== HFile Tool
+
+To view a textualized version of hfile content, you can do use the [class]+org.apache.hadoop.hbase.io.hfile.HFile
+        +tool.
+Type the following to see usage:
+[source,bourne]
+----
+$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile
+----
+For example, to view the content of the file [path]_hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475_, type the following:
+[source,bourne]
+----
+ $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475
+----
+If you leave off the option -v to see just a summary on the hfile.
+See usage for other things to do with the [class]+HFile+        tool.
+
+[[store.file.dir]]
+===== StoreFile Directory Structure on HDFS
+
+For more information of what StoreFiles look like on HDFS with respect to the directory structure, see <<trouble.namenode.hbase.objects,trouble.namenode.hbase.objects>>. 
+
+[[hfile.blocks]]
+==== Blocks
+
+StoreFiles are composed of blocks.
+The blocksize is configured on a per-ColumnFamily basis. 
+
+Compression happens at the block level within StoreFiles.
+For more information on compression, see <<compression,compression>>. 
+
+For more information on blocks, see the link:http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFileBlock.html[HFileBlock source code]. 
+
+==== KeyValue
+
+The KeyValue class is the heart of data storage in HBase.
+KeyValue wraps a byte array and takes offsets and lengths into passed array at where to start interpreting the content as KeyValue. 
+
+The KeyValue format inside a byte array is: 
+
+* keylength
+* valuelength
+* key
+* value        
+
+The Key is further decomposed as: 
+
+* rowlength
+* row (i.e., the rowkey)
+* columnfamilylength
+* columnfamily
+* columnqualifier
+* timestamp
+* keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily)        
+
+KeyValue instances are _not_ split across blocks.
+For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block.
+For more information, see the link:http://hbase.apache.org/xref/org/apache/hadoop/hbase/KeyValue.html[KeyValue source code]. 
+
+[[keyvalue.example]]
+===== Example
+
+To emphasize the points above, examine what happens with two Puts for two different columns for the same row:
+
+* Put #1:  [code]+rowkey=row1, cf:attr1=value1+
+* Put #2:  [code]+rowkey=row1, cf:attr2=value2+
+
+Even though these are for the same row, a KeyValue is created for each column:
+
+Key portion for Put #1: 
+
+* rowlength [code]+------------> 4+
+* row [code]+-----------------> row1+
+* columnfamilylength [code]+---> 2+
+* columnfamily [code]+--------> cf+
+* columnqualifier [code]+------> attr1+
+* timestamp [code]+-----------> server time of Put+
+* keytype [code]+-------------> Put+          
+
+Key portion for Put #2: 
+
+* rowlength [code]+------------> 4+
+* row [code]+-----------------> row1+
+* columnfamilylength [code]+---> 2+
+* columnfamily [code]+--------> cf+
+* columnqualifier [code]+------> attr2+
+* timestamp [code]+-----------> server time of Put+
+* keytype [code]+-------------> Put+                     
+
+It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within the KeyValue instance.
+The longer these identifiers are, the bigger the KeyValue is.
+
+==== Compaction
+
+* .Ambiguous TerminologyA [firstterm]_StoreFile_ is a facade of HFile.
+  In terms of compaction, use of StoreFile seems to have prevailed in the past.
+* A [firstterm]_Store_ is the same thing as a ColumnFamily.
+  StoreFiles are related to a Store, or C

<TRUNCATED>

Mime
View raw message