hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Revell (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter
Date Tue, 27 Sep 2011 00:01:24 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115051#comment-13115051

Dave Revell commented on HBASE-4489:

Mingjie, I would agree with you if the existing behavior was sane, but it has some problems:

1. Using ASCII strings as keys is a poor choice, and to have it be a default in a builtin
tool would send the wrong message. Since HFiles repeat the key for every cell in the table,
small key size is very important.

2. The MD5StringSplit class contains a bug that makes the current behavior even less sane.
It assumes that an ASCII hex representation of an MD5 hash begins with 0, 1, 2, 3, 4, 5, 6,
or 7. This is incorrect, since an MD5 hash is just a 128-bit number and can start with any
digit. The result will be a single oversized region at the high end of the key space.

So as far as I can tell, the existing behavior does the wrong thing, and furthermore does
it wrongly. We shouldn't preserve this situation. 

If I've misunderstood the situation I definitely welcome corrections.
> Better key splitting in RegionSplitter
> --------------------------------------
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
> The RegionSplitter utility allows users to create a pre-split table from the command
line or do a rolling split on an existing table. It supports pluggable split algorithms that
implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes
keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not
a sane default, and seems useless to most users. Users are likely to be surprised by the fact
that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes,
which is what this patch does. Making a table with five regions would split at \x33\x33...,
\x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message