HBase and Schema Design

Reply-To: dev@hbase.apache.org Delivered-To: mailing list commits@hbase.apache.org Received: (qmail 55379 invoked by uid 99); 13 Mar 2013 15:21:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 15:21:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 15:21:23 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id E08532388993; Wed, 13 Mar 2013 15:20:21 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1455996 [2/7] - in /hbase/branches/0.94/src: docbkx/ site/ site/resources/css/ site/resources/images/ site/xdoc/ Date: Wed, 13 Mar 2013 15:20:20 -0000 To: commits@hbase.apache.org From: stack@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20130313152021.E08532388993@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Modified: hbase/branches/0.94/src/docbkx/book.xml URL: http://svn.apache.org/viewvc/hbase/branches/0.94/src/docbkx/book.xml?rev=1455996&r1=1455995&r2=1455996&view=diff ============================================================================== --- hbase/branches/0.94/src/docbkx/book.xml (original) +++ hbase/branches/0.94/src/docbkx/book.xml Wed Mar 13 15:20:19 2013 @@ -1,7 +1,6 @@ HBase and Schema Design A good general introduction on the strength and weaknesses modelling on - the various non-rdbms datastores is Ian Varleys' Master thesis, + the various non-rdbms datastores is Ian Varley's Master thesis, No Relation: The Mixed Blessings of Non-Relational Databases. Recommended. Also, read for how HBase stores data internally. @@ -575,31 +581,31 @@ htable.put(put); Tables must be disabled when making ColumnFamily modifications, for example.. -Configuration config = HBaseConfiguration.create(); -HBaseAdmin admin = new HBaseAdmin(conf); +Configuration config = HBaseConfiguration.create(); +HBaseAdmin admin = new HBaseAdmin(conf); String table = "myTable"; -admin.disableTable(table); +admin.disableTable(table); HColumnDescriptor cf1 = ...; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = ...; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily -admin.enableTable(table); +admin.enableTable(table); See for more information about configuring client connections. Note: online schema changes are supported in the 0.92.x codebase, but the 0.90.x codebase requires the table to be disabled. -

Schema Updates +

Schema Updates When changes are made to either Tables or ColumnFamilies (e.g., region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written. See for more information on StoreFiles.

On the number of column families @@ -610,7 +616,7 @@ admin.enableTable(table); if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by - changing flushing and compaction to work on a per column family basis). For more information + changing flushing and compaction to work on a per column family basis). For more information on compactions, see <xref linkend="compaction"/>. </para> <para>Try to make do with one column family if you can in your schemas. Only introduce a @@ -618,9 +624,9 @@ admin.enableTable(table); i.e. you query one column family or the other but usually not both at the one time. </para> <section xml:id="number.of.cfs.card"><title>Cardinality of ColumnFamilies - Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). - If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA's data will likely be spread - across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient. + Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). + If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA's data will likely be spread + across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.

@@ -632,7 +638,7 @@ admin.enableTable(table); In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad. The pile-up on a single region brought on - by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key. + by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key. @@ -670,20 +676,20 @@ admin.enableTable(table); See for more information on HBase stores data internally to see why this is important.

Column Families Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default). - + See for more information on HBase stores data internally to see why this is important.

Attributes Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase. - + See for more information on HBase stores data internally to see why this is important.

Rowkey Length - Keep them as short as is reasonable such that they can still be useful for required data access (e.g., Get vs. Scan). + Keep them as short as is reasonable such that they can still be useful for required data access (e.g., Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs when designing rowkeys. - +

Byte Patterns A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. @@ -696,28 +702,28 @@ admin.enableTable(table); long l = 1234567890L; byte[] lb = Bytes.toBytes(l); System.out.println("long bytes length: " + lb.length); // returns 8 - + String s = "" + l; byte[] sb = Bytes.toBytes(s); System.out.println("long as string length: " + sb.length); // returns 10 - -// hash + +// hash // MessageDigest md = MessageDigest.getInstance("MD5"); byte[] digest = md.digest(Bytes.toBytes(s)); System.out.println("md5 digest bytes length: " + digest.length); // returns 16 - + String sDigest = new String(digest); byte[] sbDigest = Bytes.toBytes(sDigest); -System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26 - +System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26 +

- +

Reverse Timestamps A common problem in database processing is quickly finding the most recent version of a value. A technique using reverse timestamps - as a part of the key can help greatly with a special case of this problem. Also found in the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly), + as a part of the key can help greatly with a special case of this problem. Also found in the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly), the technique involves appending (Long.MAX_VALUE - timestamp) to the end of any key, e.g., [key][reverse_timestamp]. The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record. Since HBase keys @@ -734,11 +740,76 @@ System.out.println("md5 digest as string

Immutability of Rowkeys Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. - This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've + This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've inserted a lot of data).

- +

Relationship Between RowKeys and Region Splits + If you pre-split your table, it is critical to understand how your rowkey will be distributed across + the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the + lead position of the key (e.g., ""0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split + (which is the split strategy used when creating regions in HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions) + for 10 regions will generate the following splits... + + + +48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0 +54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6 +61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // = +68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D +75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K +82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R +88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X +95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _ +102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f + + ... (note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', + everything is great, right? Not so fast. + + The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and + possibly "hot") region problem. To understand why, refer to an ASCII Table. + '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this + keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions regions will + never be used. To make pre-spliting work with this example keyspace, a custom definition of splits (i.e., and not relying on the + built-in split method) is required. + + Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the + regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen + with any keyspace. Know your data. + + Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split + tables as long as all the created regions are accessible in the keyspace. + + To conclude this example, the following is an example of how appropriate splits can be pre-created for hex-keys:. + +public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) +throws IOException { + try { + admin.createTable( table, splits ); + return true; + } catch (TableExistsException e) { + logger.info("table " + table.getNameAsString() + " already exists"); + // the table already exists... + return false; + } +} + +public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { + byte[][] splits = new byte[numRegions-1][]; + BigInteger lowestKey = new BigInteger(startKey, 16); + BigInteger highestKey = new BigInteger(endKey, 16); + BigInteger range = highestKey.subtract(lowestKey); + BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); + lowestKey = lowestKey.add(regionIncrement); + for(int i=0; i < numRegions-1;i++) { + BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); + byte[] b = String.format("%016x", key).getBytes(); + splits[i] = b; + } + return splits; +} +

Number of Versions @@ -752,8 +823,8 @@ System.out.println("md5 digest as string stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs. </para> - <para>It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are - very dear to you because this will greatly increase StoreFile size. + <para>It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are + very dear to you because this will greatly increase StoreFile size. </para> </section> <section xml:id="schema.minversions"> @@ -778,24 +849,24 @@ System.out.println("md5 digest as string HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be - converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes. + converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes. There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); - search the mailling list for conversations on this topic. All rows in HBase conform to the datamodel, and - that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily. + search the mailling list for conversations on this topic. All rows in HBase conform to the datamodel, and + that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.

Counters - One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See + One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in HTable. Synchronization on counters are done on the RegionServer, not in the client. -

Joins - If you have multiple tables, don't forget to factor in the potential for into the schema design. + If you have multiple tables, don't forget to factor in the potential for into the schema design.

@@ -828,22 +899,22 @@ System.out.println("md5 digest as string Secondary Indexes and Alternate Query Paths This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." - A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain + A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not. There is no single answer on the best way to handle this because it depends on... - Number of users + Number of users Data size and data arrival rate - Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges) - Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others) + Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges) + Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others) - ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. - Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches. + ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. + Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches. - It should not be a surprise that secondary indexes require additional cluster space and processing. + It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products - are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off. + are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off. Pay attention to when implementing any of these approaches. Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase @@ -860,7 +931,7 @@ System.out.println("md5 digest as string Periodic-Update Secondary Index - A secondary index could be created in an other table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on + A secondary index could be created in an other table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table. See for more information.

@@ -868,7 +939,7 @@ System.out.println("md5 digest as string Dual-Write Secondary Index - Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). + Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see ).

@@ -888,12 +959,12 @@ System.out.println("md5 digest as string

Schema Design Smackdown - This section will describe common schema design questions that appear on the dist-list. These are - general guidelines and not laws - each application must consider it's own needs. + This section will describe common schema design questions that appear on the dist-list. These are + general guidelines and not laws - each application must consider its own needs.

Rows vs. Versions A common question is whether one should prefer rows or HBase's built-in-versioning. The context is typically where there are - "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 3 max versions). The + "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 3 max versions). The rows-approach would require storing a timstamp in some portion of the rowkey so that they would not overwite with each successive update. Preference: Rows (generally speaking). @@ -901,18 +972,29 @@ System.out.println("md5 digest as string

Rows vs. Columns Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide - tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece. + tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece. + + Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the + standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two + options, and that is "Rows as Columns." - Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the - standard use-case where one needs to store a few dozen or hundred columns. +

Rows as Columns + The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. + OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as + columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the + advantage of being I/O efficient. For an overview of this approach, see + Lessons Learned from OpenTSDB + from HBaseCon2012.

Operational and Performance Configuration Options See the Performance section for more information operational and performance schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes. -

Constraints HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (eg. make sure values are in the range 1-10). @@ -942,9 +1024,9 @@ System.out.println("md5 digest as string

Custom Splitters - For those interested in implementing custom splitters, see the method getSplits in + For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. - That is where the logic for map-task assignment resides. + That is where the logic for map-task assignment resides.

@@ -959,22 +1041,22 @@ System.out.println("md5 digest as string Configuration config = HBaseConfiguration.create(); Job job = new Job(config, "ExampleRead"); job.setJarByClass(MyReadJob.class); // class that contains mapper - + Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs ... - + TableMapReduceUtil.initTableMapperJob( tableName, // input HBase table name scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper - null, // mapper output key + null, // mapper output key null, // mapper output value job); job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper - + boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); @@ -987,24 +1069,24 @@ public static class MyMapper extends Tab public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { // process data for the row from the Result instance. } -} +}

HBase MapReduce Read/Write Example - The following is an example of using HBase both as a source and as a sink with MapReduce. + The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another. Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleReadWrite"); job.setJarByClass(MyReadWriteJob.class); // class that contains mapper - + Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs - + TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection @@ -1017,17 +1099,17 @@ TableMapReduceUtil.initTableReducerJob( null, // reducer class job); job.setNumReduceTasks(0); - + boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); } - An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. + An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable. - These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier. + These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier. The following is the example mapper, which will create a Put and matching the input Result and emit it. Note: this is what the CopyTable utility does. @@ -1038,7 +1120,7 @@ public static class MyMapper extends Tab // this example is just copying the data from the source table... context.write(row, resultToPut(row,value)); } - + private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { Put put = new Put(key.get()); for (KeyValue kv : result.raw()) { @@ -1049,9 +1131,9 @@ public static class MyMapper extends Tab } There isn't actually a reducer step, so TableOutputFormat takes care of sending the Put - to the target table. + to the target table. - This is just an example, developers could choose not to use TableOutputFormat and connect to the + This is just an example, developers could choose not to use TableOutputFormat and connect to the target table themselves. @@ -1063,18 +1145,18 @@ public static class MyMapper extends Tab

HBase MapReduce Summary to HBase Example - The following example uses HBase as a MapReduce source and sink with a summarization step. This example will + The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table. Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummary"); job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer - + Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs - + TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection @@ -1087,20 +1169,20 @@ TableMapReduceUtil.initTableReducerJob( MyTableReducer.class, // reducer class job); job.setNumReduceTasks(1); // at least one, adjust as required - + boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); -} +} - In this example mapper a column with a String-value is chosen as the value to summarize upon. + In this example mapper a column with a String-value is chosen as the value to summarize upon. This value is used as the key to emit from the mapper, and an IntWritable represents an instance counter. public static class MyMapper extends TableMapper<Text, IntWritable> { private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); - + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { String val = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr1"))); text.set(val); // we can only emit Writables... @@ -1112,7 +1194,7 @@ public static class MyMapper extends Tab In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put. public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { - + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { @@ -1131,17 +1213,17 @@ public static class MyTableReducer exten HBase MapReduce Summary to File Example This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same. - + Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummaryToFile"); job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer - + Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs - + TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection @@ -1152,22 +1234,22 @@ TableMapReduceUtil.initTableMapperJob( job.setReducerClass(MyReducer.class); // reducer class job.setNumReduceTasks(1); // at least one, adjust as required FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required - + boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); -} +} - As stated above, the previous Mapper can run unchanged with this example. + As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts. public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { - + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); - } + } context.write(key, new IntWritable(i)); } } @@ -1176,11 +1258,11 @@ if (!b) {

HBase MapReduce Summary to HBase Without Reducer It is also possible to perform summaries without a reducer - if you use HBase as the reducer. - + An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue - would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map + would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the


-       cleanup

method of the mapper. However, your milage may vary depending on the number of rows to be processed and + cleanup method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys. In the end, the summary results are in HBase. @@ -1192,41 +1274,41 @@ if (!b) { to generate summaries directly to an RDBMS via a custom reducer. The setup method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection. - + It is critical to understand that number of reducers for the job affects the summarization implementation, and you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that - are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point. + are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point. public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Connection c = null; - + public void setup(Context context) { // create DB connection... } - + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // do summarization // in this example the keys are Text, but this is just an example } - + public void cleanup(Context context) { // close db connection } - + } In the end, the summary results are written to your RDBMS table/s.

- +

Accessing Other HBase Tables in a MapReduce Job Although the framework currently allows one HBase table as input to a - MapReduce job, other HBase tables can + MapReduce job, other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an HTable instance in the setup method of the Mapper. public class MyMapper extends TableMapper<Text, LongWritable> { @@ -1235,12 +1317,12 @@ if (!b) { public void setup(Context context) { myOtherTable = new HTable("myOtherTable"); } - + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { // process Result... // use 'myOtherTable' for lookups } - +

@@ -1253,10 +1335,13 @@ if (!b) { map-tasks which will double-write your data to HBase; this is probably not what you want. + See for more information. + - + + Architecture

@@ -1264,24 +1349,24 @@ if (!b) {

NoSQL? HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which - supports SQL as it's primary access language, but there are many types of NoSQL databases: BerkeleyDB is an + supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. However, HBase has many features which supports both linear and modular scaling. HBase clusters expand - by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 + by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are: - Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This + Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation. Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows. Automatic RegionServer failover - Hadoop/HDFS Integration: HBase supports HDFS out of the box as it's distributed file system. - MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both + Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system. + MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink. Java Client API: HBase supports an easy to use Java API for programmatic access. Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends. @@ -1289,12 +1374,12 @@ if (!b) { Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics. -

- +

When Should I Use HBase? HBase isn't suitable for every problem. - First, make sure you have enough data. If you have hundreds of millions or billions of rows, then + First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. @@ -1302,7 +1387,7 @@ if (!b) { Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a - complete redesign as opposed to a port. + complete redesign as opposed to a port. Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. @@ -1313,9 +1398,9 @@ if (!b) {

What Is The Difference Between HBase and Hadoop/HDFS? - HDFS is a distributed file system that is well suited for the storage of large files. - It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. - HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. + HDFS is a distributed file system that is well suited for the storage of large files. + It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. + HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the and the rest of this chapter for more information on how HBase achieves its goals. @@ -1324,19 +1409,19 @@ if (!b) {

Catalog Tables - The catalog tables -ROOT- and .META. exist as HBase tables. They are are filtered out + The catalog tables -ROOT- and .META. exist as HBase tables. They are filtered out of the HBase shell's list command, but they are in fact tables just like any other.

ROOT - -ROOT- keeps track of where the .META. table is. The -ROOT- table structure is as follows: + -ROOT- keeps track of where the .META. table is. The -ROOT- table structure is as follows: - Key: + Key: .META. region key (.META.,,1) - Values: + Values: info:regioninfo (serialized HRegionInfo instance of .META.) @@ -1347,14 +1432,14 @@ if (!b) {

META - The .META. table keeps a list of all regions in the system. The .META. table structure is as follows: + The .META. table keeps a list of all regions in the system. The .META. table structure is as follows: - Key: + Key: Region key of the format ([table],[region start key],[region id]) - Values: + Values: info:regioninfo (serialized HRegionInfo instance for this region) @@ -1363,12 +1448,12 @@ if (!b) { info:serverstartcode (start-time of the RegionServer process containing this region) - When a table is in the process of splitting two other columns will be created, info:splitA and info:splitB + When a table is in the process of splitting two other columns will be created, info:splitA and info:splitB which represent the two daughter regions. The values for these columns are also serialized HRegionInfo instances. After the region has been split eventually this row will be deleted. Notes on HRegionInfo: the empty key is used to denote table start and table end. A region with an empty start key - is the first region in a table. If region has both an empty start and an empty end key, its the only region in the table + is the first region in a table. If region has both an empty start and an empty end key, it's the only region in the table In the (hopefully unlikely) event that programmatic processing of catalog metadata is required, see the Writables utility. @@ -1380,9 +1465,9 @@ if (!b) { For information on region-RegionServer assignment, see . -

- +

Client The HBase client @@ -1398,7 +1483,7 @@ if (!b) { need not go through the lookup process. Should a region be reassigned either by the master load balancer or because a RegionServer has died, the client will requery the catalog tables to determine the new - location of the user region. + location of the user region. See for more information about the impact of the Master on HBase Client communication. @@ -1406,10 +1491,11 @@ if (!b) { Administrative functions are handled through HBaseAdmin

Connections - For connection configuration information, see . + For connection configuration information, see . - HTable -instances are not thread-safe. When creating HTable instances, it is advisable to use the same HBaseConfiguration + HTable + instances are not thread-safe. Only one thread use an instance of HTable at any given + time. When creating HTable instances, it is advisable to use the same HBaseConfiguration instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers which is usually what you want. For example, this is preferred: HBaseConfiguration conf = HBaseConfiguration.create(); @@ -1425,7 +1511,16 @@ HTable table2 = new HTable(conf2, "myTab

Connection Pooling For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads - in a single JVM), see HTablePool. + in a single JVM), one solution is HTablePool. + But as written currently, it is difficult to control client resource consumption when using HTablePool. + + + Another solution is to precreate an HConnection using + HConnectionManager.createConnection(Configuration) as + well as an ExecutorService; then use the + HTable(byte[], HConnection, ExecutorService) + constructor to create HTable instances on demand. + This construction is very lightweight and resources are controlled/shared if you go this route.

@@ -1436,9 +1531,9 @@ HTable table2 = new HTable(conf2, "myTab is filled. The writebuffer is 2MB by default. Before an HTable instance is discarded, either close() or flushCommits() should be invoked so Puts - will not be lost. - - Note: htable.delete(Delete); does not go in the writebuffer! This only applies to Puts. + will not be lost. + + Note: htable.delete(Delete); does not go in the writebuffer! This only applies to Puts. For additional information on write durability, review the ACID semantics page. @@ -1456,15 +1551,15 @@ HTable table2 = new HTable(conf2, "myTab in the client API however they are discouraged because if not managed properly these can lock up the RegionServers. - There is an oustanding ticket HBASE-2332 to + There is an oustanding ticket HBASE-2332 to remove this feature from the client.

- +

Client Request Filters Get and Scan instances can be - optionally configured with filters which are applied on the RegionServer. + optionally configured with filters which are applied on the RegionServer. Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups of Filter functionality. @@ -1473,8 +1568,8 @@ HTable table2 = new HTable(conf2, "myTab Structural Filters contain other Filters.

FilterList FilterList - represents a list of Filters with a relationship of FilterList.Operator.MUST_PASS_ALL or - FilterList.Operator.MUST_PASS_ONE between the Filters. The following example shows an 'or' between two + represents a list of Filters with a relationship of FilterList.Operator.MUST_PASS_ALL or + FilterList.Operator.MUST_PASS_ONE between the Filters. The following example shows an 'or' between two Filters (checking for either 'my value' or 'my other value' on the same attribute). FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); @@ -1521,7 +1616,7 @@ scan.setFilter(filter);

RegexStringComparator RegexStringComparator - supports regular expressions for value comparisons. + supports regular expressions for value comparisons. RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' SingleColumnValueFilter filter = new SingleColumnValueFilter( @@ -1532,7 +1627,7 @@ SingleColumnValueFilter filter = new Sin ); scan.setFilter(filter); - See the Oracle JavaDoc for supported RegEx patterns in Java. + See the Oracle JavaDoc for supported RegEx patterns in Java.

SubstringComparator @@ -1663,36 +1758,40 @@ rs.close();

RowKey

RowFilter - It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however + It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however RowFilter can also be used.

Utility

FirstKeyOnlyFilter - This is primarily used for rowcount jobs. + This is primarily used for rowcount jobs. See FirstKeyOnlyFilter.

- +

Master HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is - the interface for all metadata changes. In a distributed cluster, the Master typically runs on the . + the interface for all metadata changes. In a distributed cluster, the Master typically runs on the + J Mohamed Zahoor goes into some more detail on the Master Architecture in this blog posting, HBase HMaster Architecture + . +

Startup Behavior If run in a multi-Master environment, all Masters compete to run the cluster. If the active - Master loses it's lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to + Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to take over the Master role.

Runtime Impact A common dist-list question is what happens to an HBase cluster when the Master goes down. Because the - HBase client talks directly to the RegionServers, the cluster can still function in a "steady + HBase client talks directly to the RegionServers, the cluster can still function in a "steady state." Additionally, per ROOT and META exist as HBase tables (i.e., are - not resident in the Master). However, the Master controls critical functions such as RegionServer failover and - completing region splits. So while the cluster can still run for a time without the Master, - the Master should be restarted as soon as possible. + not resident in the Master). However, the Master controls critical functions such as RegionServer failover and + completing region splits. So while the cluster can still run for a time without the Master, + the Master should be restarted as soon as possible.

Interface @@ -1700,20 +1799,20 @@ rs.close(); Table (createTable, modifyTable, removeTable, enable, disable) - ColumnFamily (addColumn, modifyColumn, removeColumn) + ColumnFamily (addColumn, modifyColumn, removeColumn) Region (move, assign, unassign) - For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server. + For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server.

Processes The Master runs several background threads:

LoadBalancer - Periodically, and when there are not any regions in transition, - a load balancer will run and move regions around to balance cluster load. + Periodically, and when there are no regions in transition, + a load balancer will run and move regions around to balance the cluster's load. See for configuring this property. See for more information on region assignment. @@ -1726,18 +1825,18 @@ rs.close();

RegionServer HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. - In a distributed cluster, a RegionServer runs on a . + In a distributed cluster, a RegionServer runs on a .

Interface The methods exposed by HRegionRegionInterface contain both data-oriented and region-maintenance methods: Data (get, put, delete, next, etc.) - Region (splitRegion, compactRegion, etc.) + Region (splitRegion, compactRegion, etc.) For example, when the HBaseAdmin method majorCompact is invoked on a table, the client is actually iterating through - all regions for the specified table and requesting a major compaction directly to each region. + all regions for the specified table and requesting a major compaction directly to each region.

Processes @@ -1761,7 +1860,7 @@ rs.close(); posted. Documentation will eventually move to this reference guide, but the blog is the most current information available at this time.

- +

Block Cache

@@ -1849,9 +1948,9 @@ rs.close(); Purpose Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) - first, and then to the for the affected . - This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure - before each MemStore is flushed and new StoreFiles are written. HLog + first, and then to the for the affected . + This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure + before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer. The WAL is in HDFS in /hbase/.logs/ with subdirectories per region. @@ -1875,11 +1974,11 @@ rs.close();

<varname>hbase.hlog.split.skip.errors</varname> - When set to true, the default, any error + When set to true, any error encountered splitting will be logged, the problematic WAL will be moved into the .corrupt directory under the hbase rootdir, and processing will continue. If set to - false, the exception will be propagated and the + false, the default, the exception will be propagated and the split logged as failed. See HBASE-2958 @@ -1912,10 +2011,10 @@ rs.close();

Regions Regions are the basic element of availability and - distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects + distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows: -Table (HBase table) +Table (HBase table) Region (Regions for the table) Store (Store per ColumnFamily for each Region for the table) MemStore (MemStore for each Store for each Region for the table) @@ -1924,7 +2023,7 @@ rs.close(); For a description of what HBase files look like when written to HDFS, see . - +

Region Size @@ -1936,13 +2035,13 @@ rs.close(); HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire - cluster will be idle. This really cant be stressed enough, since a - common problem is loading 200MB data into HBase then wondering why + cluster will be idle. This really cant be stressed enough, since a + common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster isn't doing anything. - On the other hand, high region count has been known to make things slow. + On the other hand, high region count has been known to make things slow. This is getting better with each release of HBase, but it is probably better to have 700 regions than 3000 for the same amount of data. @@ -1953,7 +2052,7 @@ rs.close(); - When starting off, its probably best to stick to the default region-size, perhaps going + When starting off, it's probably best to stick to the default region-size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up). @@ -1977,10 +2076,10 @@ rs.close(); If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept. - If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the + If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the region. The DefaultLoadBalancer will randomly assign the region to a RegionServer. - META is updated with the RegionServer assignment (if needed) and the RegionServer start codes + META is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer. @@ -1996,7 +2095,7 @@ rs.close(); The Master will detect that the RegionServer has failed. The region assignments will be considered invalid and will be re-assigned just - like the startup sequence. + like the startup sequence. @@ -2023,14 +2122,14 @@ rs.close(); Third replica is written to a node in another rack (if sufficient nodes) - Thus, HBase eventually achieves locality for a region after a flush or a compaction. [... 884 lines stripped ...]