Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 4 Oct 2011 19:02:33 +0000 (UTC)
From: "Nicolas Spiegelberg (Commented) (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: 
 <1776817013.8685.1317754953930.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1774716834.337.1317074472798.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HBASE-4489) Better key splitting in
 RegionSplitter
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120382#comment-13120382 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

@Dave:

Some more clarity on MD5StringSplit:

1. In the original RegionSplitter use case, the key is MD5(username) + username.  In the general case, users probably should use an MD5 algorithm to partition their data if there is not access locality between 2 adjacent keys.  This provides proper random dispersion and helps avoid hot row issues.
2. It was kept as ASCII for readability on WebUI and logs.  The original application was IO bound & the data size was ~1000 bytes/entry, so key size saving was not a huge issue.  Additionally, key compression (HBASE-4218) will be available shortly and make byte code optimization pretty negligible compared to readability benefits.
3. The end range is not a bug. Java doesn't natively support uint32, so 7FFF is Max Int for an MD5 hash unless you use int64 in your calculations.  I'm guessing that you're using the Thrift API, but you probably should code for the native interface.
4. The uneven key space can (and probably should) be fixed, but it is not a significant issue with a large number of regions.  For N regions across a range of K, the common region size is floor(K/N) & the skewed region is K - (N-1) * floor(K/N) == floor(K/N) + K % N < 2N.  So the worse case is that one region has double load.  For 10 regions/server, 10% variance.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira