0.95 RPC Specification

X-Mailer: ASF-Git Admin Mailer Subject: [06/15] hbase git commit: HBASE-14066 clean out old docbook docs from branch-1. http://git-wip-us.apache.org/repos/asf/hbase/blob/fdd2692f/src/main/docbkx/rpc.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/rpc.xml b/src/main/docbkx/rpc.xml deleted file mode 100644 index 2e5dd5f..0000000 --- a/src/main/docbkx/rpc.xml +++ /dev/null @@ -1,301 +0,0 @@ - - - - - 0.95 RPC Specification - In 0.95, all client/server communication is done with protobuf’ed Messages rather than - with Hadoop - Writables. Our RPC wire format therefore changes. This document describes the - client/server request/response protocol and our new RPC wire-format. - - For what RPC is like in 0.94 and previous, see Benoît/Tsuna’s Unofficial - Hadoop / HBase RPC protocol documentation. For more background on how we arrived - at this spec., see HBase - RPC: WIP - -

- Goals - - - - A wire-format we can evolve - - - A format that does not require our rewriting server core or radically - changing its current architecture (for later). - - - -

- TODO - - - - List of problems with currently specified format and where we would like - to go in a version2, etc. For example, what would we have to change if - anything to move server async or to support streaming/chunking? - - - Diagram on how it works - - - A grammar that succinctly describes the wire-format. Currently we have - these words and the content of the rpc protobuf idl but a grammar for the - back and forth would help with groking rpc. Also, a little state machine on - client/server interactions would help with understanding (and ensuring - correct implementation). - - - -

- RPC - The client will send setup information on connection establish. Thereafter, the client - invokes methods against the remote server sending a protobuf Message and receiving a - protobuf Message in response. Communication is synchronous. All back and forth is - preceded by an int that has the total length of the request/response. Optionally, - Cells(KeyValues) can be passed outside of protobufs in follow-behind Cell blocks - (because we - can’t protobuf megabytes of KeyValues or Cells). These CellBlocks are encoded - and optionally compressed. - - For more detail on the protobufs involved, see the RPC.proto - file in trunk. - -

- Connection Setup - Client initiates connection. -

- Client - On connection setup, client sends a preamble followed by a connection header. - -

- <preamble> - <MAGIC 4 byte integer> <1 byte RPC Format Version> <1 byte auth type> - We need the auth method spec. here so the connection header is encoded if auth enabled. - E.g.: HBas0x000x50 -- 4 bytes of MAGIC -- ‘HBas’ -- plus one-byte of - version, 0 in this case, and one byte, 0x50 (SIMPLE). of an auth - type. -

- -

- <Protobuf ConnectionHeader Message> - Has user info, and “protocol”, as well as the encoders and compression the - client will use sending CellBlocks. CellBlock encoders and compressors are - for the life of the connection. CellBlock encoders implement - org.apache.hadoop.hbase.codec.Codec. CellBlocks may then also be compressed. - Compressors implement org.apache.hadoop.io.compress.CompressionCodec. This - protobuf is written using writeDelimited so is prefaced by a pb varint with - its serialized length -

- - -

- Server - After client sends preamble and connection header, server does NOT respond if - successful connection setup. No response means server is READY to accept - requests and to give out response. If the version or authentication in the - preamble is not agreeable or the server has trouble parsing the preamble, it - will throw a org.apache.hadoop.hbase.ipc.FatalConnectionException explaining the - error and will then disconnect. If the client in the connection header -- i.e. - the protobuf’d Message that comes after the connection preamble -- asks for for - a Service the server does not support or a codec the server does not have, again - we throw a FatalConnectionException with explanation. -

- -

- Request - After a Connection has been set up, client makes requests. Server responds. - A request is made up of a protobuf RequestHeader followed by a protobuf Message - parameter. The header includes the method name and optionally, metadata on the - optional CellBlock that may be following. The parameter type suits the method being - invoked: i.e. if we are doing a getRegionInfo request, the protobuf Message param - will be an instance of GetRegionInfoRequest. The response will be a - GetRegionInfoResponse. The CellBlock is optionally used ferrying the bulk of the RPC - data: i.e Cells/KeyValues. -

- Request Parts -

- <Total Length> - The request is prefaced by an int that holds the total length of what - follows. -

- <Protobuf RequestHeader Message> - Will have call.id, trace.id, and method name, etc. including optional - Metadata on the Cell block IFF one is following. Data is protobuf’d inline - in this pb Message or optionally comes in the following CellBlock -

- <Protobuf Param Message> - If the method being invoked is getRegionInfo, if you study the Service - descriptor for the client to regionserver protocol, you will find that the - request sends a GetRegionInfoRequest protobuf Message param in this - position. -

- <CellBlock> - An encoded and optionally compressed Cell block. -

- -

- - -

- Response - Same as Request, it is a protobuf ResponseHeader followed by a protobuf Message - response where the Message response type suits the method invoked. Bulk of the data - may come in a following CellBlock. -

- Response Parts -

- <Total Length> - The response is prefaced by an int that holds the total length of what - follows. -

- <Protobuf ResponseHeader Message> - Will have call.id, etc. Will include exception if failed processing. - Optionally includes metadata on optional, IFF there is a CellBlock - following. -

- -

- <Protobuf Response Message> - Return or may be nothing if exception. If the method being invoked is - getRegionInfo, if you study the Service descriptor for the client to - regionserver protocol, you will find that the response sends a - GetRegionInfoResponse protobuf Message param in this position. -

- <CellBlock> - An encoded and optionally compressed Cell block. -

- -

- - -

- Exceptions - There are two distinct types. There is the request failed which is encapsulated - inside the response header for the response. The connection stays open to receive - new requests. The second type, the FatalConnectionException, kills the - connection. - Exceptions can carry extra information. See the ExceptionResponse protobuf type. - It has a flag to indicate do-no-retry as well as other miscellaneous payload to help - improve client responsiveness. -

- CellBlocks - These are not versioned. Server can do the codec or it cannot. If new version of a - codec with say, tighter encoding, then give it a new class name. Codecs will live on - the server for all time so old clients can connect. -

- - -

- Notes -

- Constraints - In some part, current wire-format -- i.e. all requests and responses preceeded by - a length -- has been dictated by current server non-async architecture. -

- One fat pb request or header+param - We went with pb header followed by pb param making a request and a pb header - followed by pb response for now. Doing header+param rather than a single protobuf - Message with both header and param content: - - - - Is closer to what we currently have - - - Having a single fat pb requires extra copying putting the already pb’d - param into the body of the fat request pb (and same making - result) - - - We can decide whether to accept the request or not before we read the - param; for example, the request might be low priority. As is, we read - header+param in one go as server is currently implemented so this is a - TODO. - - - - The advantages are minor. If later, fat request has clear advantage, can roll out - a v2 later. -

- RPC Configurations -

- CellBlock Codecs - To enable a codec other than the default KeyValueCodec, - set hbase.client.rpc.codec to the name of the Codec class to - use. Codec must implement hbase's Codec Interface. After - connection setup, all passed cellblocks will be sent with this codec. The server - will return cellblocks using this same codec as long as the codec is on the - servers' CLASSPATH (else you will get - UnsupportedCellCodecException). - To change the default codec, set - hbase.client.default.rpc.codec. - To disable cellblocks completely and to go pure protobuf, set the default to - the empty String and do not specify a codec in your Configuration. So, set - hbase.client.default.rpc.codec to the empty string and do - not set hbase.client.rpc.codec. This will cause the client to - connect to the server with no codec specified. If a server sees no codec, it - will return all responses in pure protobuf. Running pure protobuf all the time - will be slower than running with cellblocks. -

- Compression - Uses hadoops compression codecs. To enable compressing of passed CellBlocks, - set hbase.client.rpc.compressor to the name of the Compressor - to use. Compressor must implement Hadoops' CompressionCodec Interface. After - connection setup, all passed cellblocks will be sent compressed. The server will - return cellblocks compressed using this same compressor as long as the - compressor is on its CLASSPATH (else you will get - UnsupportedCompressionCodecException). -

- http://git-wip-us.apache.org/repos/asf/hbase/blob/fdd2692f/src/main/docbkx/schema_design.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/schema_design.xml b/src/main/docbkx/schema_design.xml deleted file mode 100644 index 65e64b0..0000000 --- a/src/main/docbkx/schema_design.xml +++ /dev/null @@ -1,1247 +0,0 @@ - - - - HBase and Schema Design - A good general introduction on the strength and weaknesses modelling on the various - non-rdbms datastores is Ian Varley's Master thesis, No Relation: - The Mixed Blessings of Non-Relational Databases. Recommended. Also, read for how HBase stores data internally, and the section on . -

- Schema Creation - HBase schemas can be created or updated with or by using HBaseAdmin - in the Java API. - Tables must be disabled when making ColumnFamily modifications, for example: - -Configuration config = HBaseConfiguration.create(); -HBaseAdmin admin = new HBaseAdmin(conf); -String table = "myTable"; - -admin.disableTable(table); - -HColumnDescriptor cf1 = ...; -admin.addColumn(table, cf1); // adding new ColumnFamily -HColumnDescriptor cf2 = ...; -admin.modifyColumn(table, cf2); // modifying existing ColumnFamily - -admin.enableTable(table); - - See for more information about configuring client - connections. - Note: online schema changes are supported in the 0.92.x codebase, but the 0.90.x codebase - requires the table to be disabled. -

- Schema Updates - When changes are made to either Tables or ColumnFamilies (e.g., region size, block - size), these changes take effect the next time there is a major compaction and the - StoreFiles get re-written. - See for more information on StoreFiles. -

- On the number of column families - HBase currently does not do well with anything above two or three column families so keep - the number of column families in your schema low. Currently, flushing and compactions are done - on a per Region basis so if one column family is carrying the bulk of the data bringing on - flushes, the adjacent families will also be flushed though the amount of data they carry is - small. When many column families the flushing and compaction interaction can make for a bunch - of needless i/o loading (To be addressed by changing flushing and compaction to work on a per - column family basis). For more information on compactions, see . - Try to make do with one column family if you can in your schemas. Only introduce a second - and third column family in the case where data access is usually column scoped; i.e. you query - one column family or the other but usually not both at the one time. -

- Cardinality of ColumnFamilies - Where multiple ColumnFamilies exist in a single table, be aware of the cardinality - (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion - rows, ColumnFamilyA's data will likely be spread across many, many regions (and - RegionServers). This makes mass scans for ColumnFamilyA less efficient. -

- Rowkey Design -

- Hotspotting - Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, - allowing you to store related rows, or rows that will be read together, near each other. - However, poorly designed row keys are a common source of hotspotting. - Hotspotting occurs when a large amount of client traffic is directed at one node, or only a - few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The - traffic overwhelms the single machine responsible for hosting that region, causing - performance degradation and potentially leading to region unavailability. This can also have - adverse effects on other regions hosted by the same region server as that host is unable to - service the requested load. It is important to design data access patterns such that the - cluster is fully and evenly utilized. - To prevent hotspotting on writes, design your row keys such that rows that truly do need - to be in the same region are, but in the bigger picture, data is being written to multiple - regions across the cluster, rather than one at a time. Some common techniques for avoiding - hotspotting are described below, along with some of their advantages and drawbacks. - - Salting - Salting in this sense has nothing to do with cryptography, but refers to adding random - data to the start of a row key. In this case, salting refers to adding a randomly-assigned - prefix to the row key to cause it to sort differently than it otherwise would. The number - of possible prefixes correspond to the number of regions you want to spread the data - across. Salting can be helpful if you have a few "hot" row key patterns which come up over - and over amongst other more evenly-distributed rows. Consider the following example, which - shows that salting can spread write load across multiple regionservers, and illustrates - some of the negative implications for reads. - - - Salting Example - Suppose you have the following list of row keys, and your table is split such that - there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b' - is another. In this table, all rows starting with 'f' are in the same region. This example - focuses on rows with keys like the following: - -foo0001 -foo0002 -foo0003 -foo0004 - - Now, imagine that you would like to spread these across four different regions. You - decide to use four different salts: a, b, - c, and d. In this scenario, each of these letter - prefixes will be on a different region. After applying the salts, you have the following - rowkeys instead. Since you can now write to four separate regions, you theoretically have - four times the throughput when writing that you would have if all the writes were going to - the same region. - -a-foo0003 -b-foo0001 -c-foo0004 -d-foo0002 - - Then, if you add another row, it will randomly be assigned one of the four possible - salt values and end up near one of the existing rows. - -a-foo0003 -b-foo0001 -c-foo0003 -c-foo0004 -d-foo0002 - - Since this assignment will be random, you will need to do more work if you want to - retrieve the rows in lexicographic order. In this way, salting attempts to increase - throughput on writes, but has a cost during reads. - - - - Hashing - Instead of a random assignment, you could use a one-way hash - that would cause a given row to always be "salted" with the same prefix, in a way that - would spread the load across the regionservers, but allow for predictability during reads. - Using a deterministic hash allows the client to reconstruct the complete rowkey and use a - Get operation to retrieve that row as normal. - - - Hashing Example - Given the same situation in the salting example above, you could instead apply a - one-way hash that would cause the row with key foo0003 to always, and - predictably, receive the a prefix. Then, to retrieve that row, you - would already know the key. You could also optimize things so that certain pairs of keys - were always in the same region, for instance. - - - Reversing the Key - A third common trick for preventing hotspotting is to reverse a fixed-width or numeric - row key so that the part that changes the most often (the least significant digit) is first. - This effectively randomizes row keys, but sacrifices row ordering properties. - - See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, - and article on Salted Tables - from the Phoenix project, and the discussion in the comments of HBASE-11682 for more - information about avoiding hotspotting. -

- Monotonically Increasing Row Keys/Timeseries Data - In the HBase chapter of Tom White's book Hadoop: The Definitive Guide - (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import - process walks in lock-step with all clients in concert pounding one of the table's regions - (and thus, a single node), then moving onto the next region, etc. With monotonically - increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan - on why monotonically increasing row keys are problematic in BigTable-like datastores: monotonically - increasing values are bad. The pile-up on a single region brought on by - monotonically increasing keys can be mitigated by randomizing the input records to not be in - sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, - 3) as the row-key. - If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page - describing the schema it uses in HBase. The key - format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at - first glance to contradict the previous advice about not using a timestamp as the key. - However, the difference is that the timestamp is not in the lead - position of the key, and the design assumption is that there are dozens or hundreds (or - more) of different metric types. Thus, even with a continual stream of input data with a mix - of metric types, the Puts are distributed across various points of regions in the table. - See for some rowkey design examples. -

- Try to minimize row and column sizes - Or why are my StoreFile indices large? - In HBase, values are always freighted with their coordinates; as a cell value passes - through the system, it'll be accompanied by its row, column name, and timestamp - always. If - your rows and column names are large, especially compared to the size of the cell value, - then you may run up against some interesting scenarios. One such is the case described by - Marc Limotte at the tail of HBASE-3551 - (recommended!). Therein, the indices that are kept on HBase storefiles () to facilitate random access may end up occupyng large chunks of the - HBase allotted RAM because the cell value coordinates are large. Mark in the above cited - comment suggests upping the block size so entries in the store file index happen at a larger - interval or modify the table schema so it makes for smaller rows and column names. - Compression will also make for larger indices. See the thread a - question storefileIndexSize up on the user mailing list. - Most of the time small inefficiencies don't matter all that much. Unfortunately, this is - a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and - rowkeys they could be repeated several billion times in your data. - See for more information on HBase stores data internally to see why this - is important. -

- Column Families - Try to keep the ColumnFamily names as small as possible, preferably one character - (e.g. "d" for data/default). - See for more information on HBase stores data internally to see why - this is important. -

- Attributes - Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to - read, prefer shorter attribute names (e.g., "via") to store in HBase. - See for more information on HBase stores data internally to see why - this is important. -

- Rowkey Length - Keep them as short as is reasonable such that they can still be useful for required - data access (e.g., Get vs. Scan). A short key that is useless for data access is not - better than a longer key with better get/scan properties. Expect tradeoffs when designing - rowkeys. -

- Byte Patterns - A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 - in those eight bytes. If you stored this number as a String -- presuming a byte per - character -- you need nearly 3x the bytes. - Not convinced? Below is some sample code that you can run on your own. - -// long -// -long l = 1234567890L; -byte[] lb = Bytes.toBytes(l); -System.out.println("long bytes length: " + lb.length); // returns 8 - -String s = "" + l; -byte[] sb = Bytes.toBytes(s); -System.out.println("long as string length: " + sb.length); // returns 10 - -// hash -// -MessageDigest md = MessageDigest.getInstance("MD5"); -byte[] digest = md.digest(Bytes.toBytes(s)); -System.out.println("md5 digest bytes length: " + digest.length); // returns 16 - -String sDigest = new String(digest); -byte[] sbDigest = Bytes.toBytes(sDigest); -System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26 - - Unfortunately, using a binary representation of a type will make your data harder to - read outside of your code. For example, this is what you will see in the shell when you - increment a value: - -hbase(main):001:0> incr 't', 'r', 'f:q', 1 -COUNTER VALUE = 1 - -hbase(main):002:0> get 't', 'r' -COLUMN CELL - f:q timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01 -1 row(s) in 0.0310 seconds - - The shell makes a best effort to print a string, and it this case it decided to just - print the hex. The same will happen to your row keys inside the region names. It can be - okay if you know what's being stored, but it might also be unreadable if arbitrary data - can be put in the same cells. This is the main trade-off. -

- -

- Reverse Timestamps - - Reverse Scan API - - HBASE-4811 - implements an API to scan a table or a range within a table in reverse, reducing the need - to optimize your schema for forward or reverse scanning. This feature is available in - HBase 0.98 and later. See - for more information. - - - A common problem in database processing is quickly finding the most recent version of a - value. A technique using reverse timestamps as a part of the key can help greatly with a - special case of this problem. Also found in the HBase chapter of Tom White's book Hadoop: - The Definitive Guide (O'Reilly), the technique involves appending (

Long.MAX_VALUE -
-          timestamp

) to the end of any key, e.g., [key][reverse_timestamp]. - The most recent value for [key] in a table can be found by performing a Scan for [key] - and obtaining the first record. Since HBase keys are in sorted order, this key sorts before - any older row-keys for [key] and thus is first. - This technique would be used instead of using where the intent is to hold onto all versions "forever" (or a - very long time) and at the same time quickly obtain access to any other version by using the - same Scan technique. -

- Rowkeys and ColumnFamilies - Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each - ColumnFamily that exists in a table without collision. -

- Immutability of Rowkeys - Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row - is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so - it pays to get the rowkeys right the first time (and/or before you've inserted a lot of - data). -

- Relationship Between RowKeys and Region Splits - If you pre-split your table, it is critical to understand how your - rowkey will be distributed across the region boundaries. As an example of why this is - important, consider the example of using displayable hex characters as the lead position of - the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through - Bytes.split (which is the split strategy used when creating regions in - HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions) for 10 - regions will generate the following splits... - -48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0 -54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6 -61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // = -68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D -75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K -82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R -88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X -95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _ -102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f - - ... (note: the lead byte is listed to the right as a comment.) Given that the first - split is a '0' and the last split is an 'f', everything is great, right? Not so fast. - The problem is that all the data is going to pile up in the first 2 regions and the last - region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer - to an ASCII Table. '0' is byte 48, and 'f' is byte - 102, but there is a huge gap in byte values (bytes 58 to 96) that will never - appear in this keyspace because the only values are [0-9] and [a-f]. Thus, the - middle regions regions will never be used. To make pre-spliting work with this example - keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) - is required. - Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split - them in such a way that all the regions are accessible in the keyspace. While this example - demonstrated the problem with a hex-key keyspace, the same problem can happen with - any keyspace. Know your data. - Lesson #2: While generally not advisable, using hex-keys (and more generally, - displayable data) can still work with pre-split tables as long as all the created regions - are accessible in the keyspace. - To conclude this example, the following is an example of how appropriate splits can be - pre-created for hex-keys:. - -

- -

- Number of Versions -

- Maximum Number of Versions - The maximum number of row versions to store is configured per column family via HColumnDescriptor. - The default for max versions is 1. This is an important parameter because as described in section HBase does not overwrite row values, - but rather stores different values per row by time (and qualifier). Excess versions are - removed during major compactions. The number of max versions may need to be increased or - decreased depending on application needs. - It is not recommended setting the number of max versions to an exceedingly high level - (e.g., hundreds or more) unless those old values are very dear to you because this will - greatly increase StoreFile size. -

- Minimum Number of Versions - Like maximum number of row versions, the minimum number of row versions to keep is - configured per column family via HColumnDescriptor. - The default for min versions is 0, which means the feature is disabled. The minimum number - of row versions parameter is used together with the time-to-live parameter and can be - combined with the number of row versions parameter to allow configurations such as "keep the - last T minutes worth of data, at most N versions, but keep at least M versions - around" (where M is the value for minimum number of row versions, M<N). This - parameter should only be set when time-to-live is enabled for a column family and must be - less than the number of row versions. -

- Supported Datatypes - HBase supports a "bytes-in/bytes-out" interface via Put - and Result, - so anything that can be converted to an array of bytes can be stored as a value. Input could - be strings, numbers, complex objects, or even images as long as they can rendered as bytes. - There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase - would probably be too much to ask); search the mailling list for conversations on this topic. - All rows in HBase conform to the , and that includes versioning. Take that into consideration when - making your design, as well as block size for the ColumnFamily. - -

- Counters - One supported datatype that deserves special mention are "counters" (i.e., the ability - to do atomic increments of numbers). See Increment - in HTable. - Synchronization on counters are done on the RegionServer, not in the client. -

- Joins - If you have multiple tables, don't forget to factor in the potential for into the schema design. -

- Time To Live (TTL) - ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows - once the expiration time is reached. This applies to all versions of a - row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC. - - Store files which contains only expired rows are deleted on minor compaction. Setting - hbase.store.delete.expired.storefile to false disables this - feature. Setting minimum number of versions to other - than 0 also disables this. - See HColumnDescriptor for more information. -

- Keeping Deleted Cells - By default, delete markers extend back to the beginning of time. Therefore, Get - or Scan - operations will not see a deleted cell (row or column), even when the Get or Scan operation - indicates a time range - before the delete marker was placed. - ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be - retrieved, as long as these operations specify a time range that ends before the timestamp of - any delete that would affect the cells. This allows for point-in-time queries even in the - presence of deletes. - Deleted cells are still subject to TTL and there will never be more than "maximum number - of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete - markers. - - Change the Value of <code>KEEP_DELETED_CELLS</code> Using HBase Shell - hbase> hbase> alter ‘t1′, NAME => ‘f1′, KEEP_DELETED_CELLS => true - - - Change the Value of <code>KEEP_DELETED_CELLS</code> Using the API - ... -HColumnDescriptor.setKeepDeletedCells(true); -... - - - See the API documentation for KEEP_DELETED_CELLS for more information. -

- Secondary Indexes and Alternate Query Paths - This section could also be titled "what if my table rowkey looks like - this but I also want to query my table like that." - A common example on the dist-list is where a row-key is of the format "user-timestamp" but - there are reporting requirements on activity across users for certain time ranges. Thus, - selecting by user is easy because it is in the lead position of the key, but time is not. - There is no single answer on the best way to handle this because it depends on... - - - Number of users - - - Data size and data arrival rate - - - Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. - pre-configured ranges) - - - Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an - ad-hoc report, whereas it may be too long for others) - - - ... and solutions are also influenced by the size of the cluster and how much processing - power you have to throw at the solution. Common techniques are in sub-sections below. This is - a comprehensive, but not exhaustive, list of approaches. - It should not be a surprise that secondary indexes require additional cluster space and - processing. This is precisely what happens in an RDBMS because the act of creating an - alternate index requires both space and processing cycles to update. RDBMS products are more - advanced in this regard to handle alternative index management out of the box. However, HBase - scales better at larger data volumes, so this is a feature trade-off. - Pay attention to when implementing any of these approaches. - Additionally, see the David Butler response in this dist-list thread HBase, - mail # user - Stargate+hbase - -

- Filter Query - Depending on the case, it may be appropriate to use . In this case, no secondary index is created. However, don't - try a full-scan on a large table like this from an application (i.e., single-threaded - client). -

- Periodic-Update Secondary Index - A secondary index could be created in an other table which is periodically updated via a - MapReduce job. The job could be executed intra-day, but depending on load-strategy it could - still potentially be out of sync with the main data table. - See for more information. -

- Dual-Write Secondary Index - Another strategy is to build the secondary index while publishing data to the cluster - (e.g., write to data table, write to index table). If this is approach is taken after a data - table already exists, then bootstrapping will be needed for the secondary index with a - MapReduce job (see ). -

- Summary Tables - Where time-ranges are very wide (e.g., year-long report) and where the data is - voluminous, summary tables are a common approach. These would be generated with MapReduce - jobs into another table. - See for more information. -

- Coprocessor Secondary Index - Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, - see - -

- Constraints - HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised - usage for Constraints is in enforcing business rules for attributes in the table (eg. make - sure values are in the range 1-10). Constraints could also be used to enforce referential - integrity, but this is strongly discouraged as it will dramatically decrease the write - throughput of the tables where integrity checking is enabled. Extensive documentation on using - Constraints can be found at: Constraint - since version 0.94. -

- Schema Design Case Studies - The following will describe some typical data ingestion use-cases with HBase, and how the - rowkey design and construction can be approached. Note: this is just an illustration of - potential approaches, not an exhaustive list. Know your data, and know your processing - requirements. - It is highly recommended that you read the rest of the first, before reading these case studies. - The following case studies are described: - - - Log Data / Timeseries Data - - - Log Data / Timeseries on Steroids - - - Customer/Order - - - Tall/Wide/Middle Schema Design - - - List Data - - -

- Case Study - Log Data and Timeseries Data - Assume that the following data elements are being collected. - - - Hostname - - - Timestamp - - - Log event - - - Value/message - - - We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From - these attributes the rowkey will be some combination of hostname, timestamp, and log-event - - but what specifically? -

- Timestamp In The Rowkey Lead Position - The rowkey [timestamp][hostname][log-event] suffers from the - monotonically increasing rowkey problem described in . - There is another pattern frequently mentioned in the dist-lists about “bucketing” - timestamps, by performing a mod operation on the timestamp. If time-oriented scans are - important, this could be a useful approach. Attention must be paid to the number of - buckets, because this will require the same number of scans to return results. - -long bucket = timestamp % numBuckets; - - … to construct: - -[bucket][timestamp][hostname][log-event] - - As stated above, to select data for a particular timerange, a Scan will need to be - performed for each bucket. 100 buckets, for example, will provide a wide distribution in - the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there - are trade-offs. -

- -

- Host In The Rowkey Lead Position - The rowkey [hostname][log-event][timestamp] is a candidate if there is a - large-ish number of hosts to spread the writes and reads across the keyspace. This - approach would be useful if scanning by hostname was a priority. -

- -

- Timestamp, or Reverse Timestamp? - If the most important access path is to pull most recent events, then storing the - timestamps as reverse-timestamps (e.g.,

timestamp = Long.MAX_VALUE –
-            timestamp

) will create the property of being able to do a Scan on - [hostname][log-event] to obtain the quickly obtain the most recently - captured events. - Neither approach is wrong, it just depends on what is most appropriate for the - situation. - - Reverse Scan API - - HBASE-4811 - implements an API to scan a table or a range within a table in reverse, reducing the - need to optimize your schema for forward or reverse scanning. This feature is available - in HBase 0.98 and later. See - for more information. - -

- -

- Variangle Length or Fixed Length Rowkeys? - It is critical to remember that rowkeys are stamped on every column in HBase. If the - hostname is “a” and the event type is “e1” then the resulting rowkey would be quite small. - However, what if the ingested hostname is “myserver1.mycompany.com” and the event type is - “com.package1.subpackage2.subsubpackage3.ImportantService”? - It might make sense to use some substitution in the rowkey. There are at least two - approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it - might look like this: - Composite Rowkey With Hashes: - - - [MD5 hash of hostname] = 16 bytes - - - [MD5 hash of event-type] = 16 bytes - - - [timestamp] = 8 bytes - - - Composite Rowkey With Numeric Substitution: - For this approach another lookup table would be needed in addition to LOG_DATA, called - LOG_TYPES. The rowkey of LOG_TYPES would be: - - - [type] (e.g., byte indicating hostname vs. event-type) - - - [bytes] variable length bytes for raw hostname or event-type. - - - A column for this rowkey could be a long with an assigned number, which could be - obtained by using an HBase - counter. - So the resulting composite rowkey would be: - - - [substituted long for hostname] = 8 bytes - - - [substituted long for event type] = 8 bytes - - - [timestamp] = 8 bytes - - - In either the Hash or Numeric substitution approach, the raw values for hostname and - event-type can be stored as columns. -

- -

- Case Study - Log Data and Timeseries Data on Steroids - This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack - rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, and Lessons - Learned from OpenTSDB from HBaseCon2012. - But this is how the general concept works: data is ingested, for example, in this - manner… - -[hostname][log-event][timestamp1] -[hostname][log-event][timestamp2] -[hostname][log-event][timestamp3] - - … with separate rowkeys for each detailed event, but is re-written like this… - [hostname][log-event][timerange] - … and each of the above events are converted into columns stored with a time-offset - relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very - advanced processing technique, but HBase makes this possible. -

- - -

- Case Study - Customer/Order - Assume that HBase is used to store customer and order information. There are two core - record-types being ingested: a Customer record type, and Order record type. - The Customer record type would include all the things that you’d typically expect: - - - Customer number - - - Customer name - - - Address (e.g., city, state, zip) - - - Phone numbers, etc. - - - The Order record type would include things like: - - - Customer number - - - Order number - - - Sales date - - - A series of nested objects for shipping locations and line-items (see for details) - - - Assuming that the combination of customer number and sales order uniquely identify an - order, these two attributes will compose the rowkey, and specifically a composite key such - as: - [customer number][order number] - … for a ORDER table. However, there are more design decisions to make: are the - raw values the best choices for rowkeys? - The same design questions in the Log Data use-case confront us here. What is the - keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it - is advantageous to use fixed-length keys in HBase, as well as keys that can support a - reasonable spread in the keyspace, similar options appear: - Composite Rowkey With Hashes: - - - [MD5 of customer number] = 16 bytes - - - [MD5 of order number] = 16 bytes - - - Composite Numeric/Hash Combo Rowkey: - - - [substituted long for customer number] = 8 bytes - - - [MD5 of order number] = 16 bytes - - -

- Single Table? Multiple Tables? - A traditional design approach would have separate tables for CUSTOMER and SALES. - Another option is to pack multiple record types into a single table (e.g., CUSTOMER++). - Customer Record Type Rowkey: - - - [customer-id] - - - [type] = type indicating ‘1’ for customer record type - - - Order Record Type Rowkey: - - - [customer-id] - - - [type] = type indicating ‘2’ for order record type - - - [order] - - - The advantage of this particular CUSTOMER++ approach is that organizes many different - record-types by customer-id (e.g., a single scan could get you everything about that - customer). The disadvantage is that it’s not as easy to scan for a particular record-type. - -

- Order Object Design - Now we need to address how to model the Order object. Assume that the class structure - is as follows: - - - Order - - (an Order can have multiple ShippingLocations - - - - LineItem - - (a ShippingLocation can have multiple LineItems - - - - ... there are multiple options on storing this data. -

- Completely Normalized - With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and - LINE_ITEM. - The ORDER table's rowkey was described above: - - The SHIPPING_LOCATION's composite rowkey would be something like this: - - - [order-rowkey] - - - [shipping location number] (e.g., 1st location, 2nd, etc.) - - - The LINE_ITEM table's composite rowkey would be something like this: - - - [order-rowkey] - - - [shipping location number] (e.g., 1st location, 2nd, etc.) - - - [line item number] (e.g., 1st lineitem, 2nd, etc.) - - - Such a normalized model is likely to be the approach with an RDBMS, but that's not - your only option with HBase. The cons of such an approach is that to retrieve - information about any Order, you will need: - - - Get on the ORDER table for the Order - - - Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation - instances - - - Scan on the LINE_ITEM for each ShippingLocation - - - ... granted, this is what an RDBMS would do under the covers anyway, but since there - are no joins in HBase you're just more aware of this fact. -

- Single Table With Record Types - With this approach, there would exist a single table ORDER that would contain - The Order rowkey was described above: - - - [order-rowkey] - - - [ORDER record type] - - - The ShippingLocation composite rowkey would be something like this: - - - [order-rowkey] - - - [SHIPPING record type] - - - [shipping location number] (e.g., 1st location, 2nd, etc.) - - - The LineItem composite rowkey would be something like this: - - - [order-rowkey] - - - [LINE record type] - - - [shipping location number] (e.g., 1st location, 2nd, etc.) - - - [line item number] (e.g., 1st lineitem, 2nd, etc.) - - -

- Denormalized - A variant of the Single Table With Record Types approach is to denormalize and - flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes - onto each LineItem instance. - The LineItem composite rowkey would be something like this: - - - [order-rowkey] - - - [LINE record type] - - - [line item number] (e.g., 1st lineitem, 2nd, etc. - care must be taken that - there are unique across the entire order) - - - ... and the LineItem columns would be something like this: - - - itemNumber - - - quantity - - - price - - - shipToLine1 (denormalized from ShippingLocation) - - - shipToLine2 (denormalized from ShippingLocation) - - - shipToCity (denormalized from ShippingLocation) - - - shipToState (denormalized from ShippingLocation) - - - shipToZip (denormalized from ShippingLocation) - - - The pros of this approach include a less complex object heirarchy, but one of the - cons is that updating gets more complicated in case any of this information changes. - -

- Object BLOB - With this approach, the entire Order object graph is treated, in one way or another, - as a BLOB. For example, the ORDER table's rowkey was described above: , and a single column called "order" would - contain an object that could be deserialized that contained a container Order, - ShippingLocations, and LineItems. - There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, - etc. All of them are variants of the same approach: encode the object graph to a - byte-array. Care should be taken with this approach to ensure backward compatibilty in - case the object model changes such that older persisted structures can still be read - back out of HBase. - Pros are being able to manage complex object graphs with minimal I/O (e.g., a single - HBase Get per Order in this example), but the cons include the aforementioned warning - about backward compatiblity of serialization, language dependencies of serialization - (e.g., Java Serialization only works with Java clients), the fact that you have to - deserialize the entire object to get any piece of information inside the BLOB, and the - difficulty in getting frameworks like Hive to work with custom objects like this. - -

- -

- - -

- Case Study - "Tall/Wide/Middle" Schema Design Smackdown - This section will describe additional schema design questions that appear on the - dist-list, specifically about tall and wide tables. These are general guidelines and not - laws - each application must consider its own needs. -

- Rows vs. Versions - A common question is whether one should prefer rows or HBase's built-in-versioning. - The context is typically where there are "a lot" of versions of a row to be retained - (e.g., where it is significantly above the HBase default of 1 max versions). The - rows-approach would require storing a timestamp in some portion of the rowkey so that they - would not overwite with each successive update. - Preference: Rows (generally speaking). -

- Rows vs. Columns - Another common question is whether one should prefer rows or columns. The context is - typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, - or 1 million rows with 1 columns apiece. - Preference: Rows (generally speaking). To be clear, this guideline is in the context - is in extremely wide cases, not in the standard use-case where one needs to store a few - dozen or hundred columns. But there is also a middle path between these two options, and - that is "Rows as Columns." -

- Rows as Columns - The middle path between Rows vs. Columns is packing data that would be a separate row - into columns, for certain rows. OpenTSDB is the best example of this case where a single - row represents a defined time-range, and then discrete events are treated as columns. This - approach is often more complex, and may require the additional complexity of re-writing - your data, but has the advantage of being I/O efficient. For an overview of this approach, - see . -

- -

- Case Study - List Data - The following is an exchange from the user dist-list regarding a fairly common question: - how to handle per-user list data in Apache HBase. - *** QUESTION *** - We're looking at how to store a large amount of (per-user) list data in HBase, and we - were trying to figure out what kind of access pattern made the most sense. One option is - store the majority of the data in a key, so we could have something like: - - :"" (no value) -:"" (no value) -:"" (no value) -]]> - - The other option we had was to do this entirely using: - :... -:... - ]]> - where each row would contain multiple values. So in one case reading the first thirty - values would be: - 'FixedWidthUsername' LIMIT => 30} - ]]> - And in the second case it would be - -get 'FixedWidthUserName\x00\x00\x00\x00' - - The general usage pattern would be to read only the first 30 values of these lists, - with infrequent access reading deeper into the lists. Some users would have <= 30 total - values in these lists, and some users would have millions (i.e. power-law distribution) - The single-value format seems like it would take up more space on HBase, but would - offer some improved retrieval / pagination flexibility. Would there be any significant - performance advantages to be able to paginate via gets vs paginating with scans? - My initial understanding was that doing a scan should be faster if our paging size is - unknown (and caching is set appropriately), but that gets should be faster if we'll always - need the same page size. I've ended up hearing different people tell me opposite things - about performance. I assume the page sizes would be relatively consistent, so for most use - cases we could guarantee that we only wanted one page of data in the fixed-page-length case. - I would also assume that we would have infrequent updates, but may have inserts into the - middle of these lists (meaning we'd need to update all subsequent rows). - Thanks for help / suggestions / follow-up questions. - *** ANSWER *** - If I understand you correctly, you're ultimately trying to store triples in the form - "user, valueid, value", right? E.g., something like: - -"user123, firstname, Paul", -"user234, lastname, Smith" - - (But the usernames are fixed width, and the valueids are fixed width). - And, your access pattern is along the lines of: "for user X, list the next 30 values, - starting with valueid Y". Is that right? And these values should be returned sorted by - valueid? - The tl;dr version is that you should probably go with one row per user+value, and not - build a complicated intra-row pagination scheme on your own unless you're really sure it is - needed. - Your two options mirror a common question people have when designing HBase schemas: - should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for - one user, and so there are many rows in the table for each user; the row key is user + - valueid, and there would be (presumably) a single column qualifier that means "the value". - This is great if you want to scan over rows in sorted order by row key (thus my question - above, about whether these ids are sorted correctly). You can start a scan at any - user+valueid, read the next 30, and be done. What you're giving up is the ability to have - transactional guarantees around all the rows for one user, but it doesn't sound like you - need that. Doing it this way is generally recommended (see here http://hbase.apache.org/book.html#schema.smackdown). - Your second option is "wide": you store a bunch of values in one row, using different - qualifiers (where the qualifier is the valueid). The simple way to do that would be to just - store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" - version because you're assuming that storing millions of columns in a single row would be - bad for performance, which may or may not be true; as long as you're not trying to do too - much in a single request, or do things like scanning over and returning all of the cells in - the row, it shouldn't be fundamentally worse. The client has methods that allow you to get - specific slices of columns. - Note that neither case fundamentally uses more disk space than the other; you're just - "shifting" part of the identifying information for a value either to the left (into the row - key, in option one) or to the right (into the column qualifiers in option 2). Under the - covers, every key/value still stores the whole row key, and column family name. (If this is - a bit confusing, take an hour and watch Lars George's excellent video about understanding - HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk). - A manually paginated version has lots more complexities, as you note, like having to - keep track of how many things are in each page, re-shuffling if new values are inserted, - etc. That seems significantly more complex. It might have some slight speed advantages (or - disadvantages!) at extremely high throughput, and the only way to really know that would be - to try it out. If you don't have time to build it both ways and compare, my advice would be - to start with the simplest option (one row per user+value). Start simple and iterate! :) - -

- - -

- -

- Operational and Performance Configuration Options - See the Performance section for more information operational and performance schema design - options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes. - -

- - -