hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesse Yates (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-2600) Change how we do meta tables; from tablename+STARTROW+randomid to instead, tablename+ENDROW+randomid
Date Mon, 22 Oct 2012 20:48:14 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481772#comment-13481772
] 

Jesse Yates commented on HBASE-2600:
------------------------------------

We've been doing a lot of thinking over here at Saleforce about this issue and was thinking
about picking up work on this, is Alex is busy. The current approach is pretty good, and has
a lot of merits. We also discussed the option of using the multi-row transaction stuff (which
will be another reason why we couldn't split META). I did a full write-up/analysis of the
options (see https://dl.dropbox.com/u/6147077/Proposal-HBASE-2600.docx). 

What I ended up coming up with is a little bit crazy, but I think it works. (I'm not dealing
with tablenames as hashes, but that is pretty trivial). What I'm looking to solve are:

(1) replacing start  key’s with endkeys
(2) ensuring correct sorting 
(3) ensuring correct split behavior to avoid META holes 
(4) moving the compound key to their own family/qualifier

There seems to be a couple pieces we can put together to ensure we meet all the above goals.
First, row keys are encoded as:

	For all non-terminal regions:
{code}	
	<table>0x00<endkey> 
{code}
	For the terminal region:
{code}
		<table>0x01
{code}

Then we can move the encoded name into its own cell, under the “info:encodedname” column.
Next, the regionid is moved to the timestamp and used for all updates the region in META (this
includes offlining and marking the parent as split).  Since regionids are already timestamps
by convention, this doesn't stray that far afield.

META then looks something like:

{code}
<table>0x00<endkey> | info |
                           | encodedname     | <regionid> | <md5 hash>
                           | regioninfo      | <regionid> | <hri – 1>
                           | server          | <regionid> | <server:port>
                           | server.startcode| <regionid> | <startcode 
                           | splitA          | <regionid> | <hri – 3>
                           | splitB          | <regionid> | <hri – 4>
<table>0x01        |  info | encodedname     | <regionid2>| <hri-4> 
                           |  ...            | <regionid2>| ...
{code}

Obviously there are some serious implications for how lookups and splits work.

Splits need to take the opposite approach with respect to putting children in META. Currently,
we write the bottom and then the top child, counting on the htable to retry when it finds
an offlined region. Now, we just flip the ordering by: (1) offline the parent, (2) put the
'top' child and then (3) insert the bottom child. 

The problem lies in making sure that the bottom child sorts before the parent. In the previous
scheme we ensured that sorting by putting a regionid in the row key. With the above scheme,
the 'top' child will always sort before the parent because it has a lower endkey. The 'bottom'
child actual has _exactly the same row key_ as the parent. However, the bottom child still
sorts first because it has a larger regionid (which is also already baked into the code).

We also must do a check of the timestamp vs. the expected regionid to ensure that we can get
the correct region, but that is a minor overhead.

NOTE: this also gives us provenance of regions, at least until the catalog janitor cleans
up parent regions.

For lookups, you would query for the first region that matches (similar to the current mechanism):
{code}
	<table>0x00<desired key>999999……
{code}

which still finds the correct (bottom) child because its regionid must be greater than its
parent causing it to sort _before_ its parent in the same row.

This give us correct sorting, an easily readable META, and no holes. Oh, and we can remove
all the backwords scanning.
                
> Change how we do meta tables; from tablename+STARTROW+randomid to instead, tablename+ENDROW+randomid
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2600
>                 URL: https://issues.apache.org/jira/browse/HBASE-2600
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Alex Newman
>         Attachments: 0001-Changed-regioninfo-format-to-use-endKey-instead-of-s.patch,
0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen.patch, 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v2.patch,
0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v4.patch, 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v6.patch,
0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v7.2.patch, 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v8,
0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v8.1, 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v9.patch,
0001-HBASE-2600.v10.patch, 0001-HBASE-2600-v11.patch, 2600-trunk-01-17.txt, HBASE-2600+5217-Sun-Mar-25-2012-v3.patch,
HBASE-2600+5217-Sun-Mar-25-2012-v4.patch, hbase-2600-root.dir.tgz, jenkins.pdf
>
>
> This is an idea that Ryan and I have been kicking around on and off for a while now.
> If regionnames were made of tablename+endrow instead of tablename+startrow, then in the
metatables, doing a search for the region that contains the wanted row, we'd just have to
open a scanner using passed row and the first row found by the scan would be that of the region
we need (If offlined parent, we'd have to scan to the next row).
> If we redid the meta tables in this format, we'd be using an access that is natural to
hbase, a scan as opposed to the perverse, expensive getClosestRowBefore we currently have
that has to walk backward in meta finding a containing region.
> This issue is about changing the way we name regions.
> If we were using scans, prewarming client cache would be near costless (as opposed to
what we'll currently have to do which is first a getClosestRowBefore and then a scan from
the closestrowbefore forward).
> Converting to the new method, we'd have to run a migration on startup changing the content
in meta.
> Up to this, the randomid component of a region name has been the timestamp of region
creation.   HBASE-2531 "32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash
clashes" proposes changing the randomid so that it contains actual name of the directory in
the filesystem that hosts the region.  If we had this in place, I think it would help with
the migration to this new way of doing the meta because as is, the region name in fs is a
hash of regionname... changing the format of the regionname would mean we generate a different
hash... so we'd need hbase-2531 to be in place before we could do this change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message