hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2531) 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes
Date Tue, 18 May 2010 23:36:54 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868903#action_12868903
] 

stack commented on HBASE-2531:
------------------------------

... continuing

As to what needs to be done, at a minimum, we need to change the way we name dirs in the filesystem.
 Currently its done as follows

{code}
  /**
   * @param regionName
   * @return the encodedName
   */
  public static int encodeRegionName(final byte [] regionName) {
    return Math.abs(JenkinsHash.getInstance().hash(regionName, regionName.length, 0));
  }
{code}

The minimally intrusive thing would be to change the above hashing to instead return a byte
array or a String and have the function md5 or sha-1 the regionName  so there is some relation
between the regionname and hash, or just return a UUID, a product that cannot be related to
the regionname.  We'd then need to go through code base and make sure that everywhere we deal
with the encoded name of the region, that we can handle BOTH the new style byte [] or String
format and the old format int.

Since we cannot derive the regionname from the UUID, we must be careful we do not misplace
the UUID.  We'd have to save it into the regions HRegionInfo object.

md5/sha-1 would be superior because we can always go from regionname to the encoded name.

I was thinking (and I think Kannan the same), that rather than timestamp alone as the 3rd
component of the regionname, that rather we'd make it so the 3rd portion of the regionname
serve two functions: its current one as differentiator between child and parent (see previous
comment) but that this 3rd component would also be what we use for the region directory in
the filesystem.   Timestamp alone would not be enough.  After this afternoon's IRC discussions,
UUID isn't suitable.  We'd have to tag on something extra.  It could be an md5 of the startkey
or it could just be jenkins hash of the startkey since likelihood of hash-of-startkey+timestamp
would clash is unlikely.

I liked this later option because you'd read the regionname and would be able to then easily
find the region's dir in the filesystem. 

This would be a more intrusive change than the one above where we just change hash function.



> 32-bit encoding of regionnames waaaaaaayyyyy too susceptible to hash clashes
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-2531
>                 URL: https://issues.apache.org/jira/browse/HBASE-2531
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> Kannan tripped over two regionnames that hashed the same:
> Here is code demo'ing that his two names hash the same:
> {code}
> package org;
> import org.apache.hadoop.hbase.util.Bytes;
> import org.apache.hadoop.hbase.util.JenkinsHash;
> public class Testing {
>   public static void main(final String [] args) {
>     System.out.println(encodeRegionName(Bytes.toBytes("test1,6838000000,1273541236167")));
>     System.out.println(encodeRegionName(Bytes.toBytes("test1,0520100000,1273541610201")));
>   }
>   /**
>    * @param regionName
>    * @return the encodedName
>    */
>   public static int encodeRegionName(final byte [] regionName) {
>     return Math.abs(JenkinsHash.getInstance().hash(regionName, regionName.length, 0));
>   }
> }
> {code}
> Need new encoding mechanism.  Will need to migrate old regions to new schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message