hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-82) row keys should be array of bytes with a specified comparator
Date Sat, 03 May 2008 22:34:55 GMT

     [ https://issues.apache.org/jira/browse/HBASE-82?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-82:
-----------------------

    Attachment: Perf.java

I need to be able to use byte arrays as keys in Maps.  Byte arrays alone don't work as Map
keys since byte [] 'Compare' using object identity rather than byte content.  I need this
functionality because rows and regionnames, etc., are byte arrays where before they were Comparable
Text.   I could wrap the byte array into an ImmutableBytesWritable once the byte array arrives
server-side and use this as Key since IBW is Comparable.  That'd work.

But, I took a look at using the hash of the byte array Integer as Map key.   For sure, if
I use a simple hash of the byte array, as we would be doing if we used IBW -- See the WritableComparator.hashBytes
which IBW (and Text) uses -- its faster especially if invocations are < 100k; its 3 to
4 times as fast.  At about 1M iterations, the difference is less.  Using the byte array hash
Integer instead of IBW is only about 20% faster.  I guess that hot spot is what makes for
the improvements but, for sure, its taking its time warming up.  Since I can make other savings
-- e.g. get rid of the rowsToLocks Map -- I'm going to go with using a hash code Integer as
keys in the locksToRows Map.

A Jenkins hash is more robust than the simple hash and its better suited to the types of keys
we'll be seeing and better than CRCs, etc. -- see http://www.ddj.com/184410284 --  but its
more expensive to make.  In my testing, it was about same as IBW at 100k or less but at 1M,
it took ~twice as long.

I did various tests.  I'll attach the last code that I was using.  It was reading a file of
750k unique-ish URLs and hashing these.  The code does HRegionServer.batchUpdate-like things
inserting into a Map in case the hashCode-making is lazy (the put will force the hash code
calculation).

I also tried wrapping the byte array in a ByteBuffer.  This was about 20% slower and more
than IBW.  I'm guessing its hashing code more involved than that of WritableComparator.

> row keys should be array of bytes with a specified comparator
> -------------------------------------------------------------
>
>                 Key: HBASE-82
>                 URL: https://issues.apache.org/jira/browse/HBASE-82
>             Project: Hadoop HBase
>          Issue Type: Wish
>            Reporter: Jim Kellerman
>            Assignee: stack
>             Fix For: 0.2.0
>
>         Attachments: 82-v2.patch, 82-v3.patch, 82-v4.patch, 82.patch, Perf.java
>
>
> I have heard from several people that row keys in HBase should be less restricted than
hadoop.io.Text.
> What do you think?
> At the very least, a row key has to be a WritableComparable. This would lead to the most
general case being either hadoop.io.BytesWritable or hbase.io.ImmutableBytesWritable. The
primary difference between these two classes is that hadoop.io.BytesWritable by default allocates
100 bytes and if you do not pay attention to the length, (BytesWritable.getSize()), converting
a String to a BytesWritable and vice versa can become problematic. 
> hbase.io.ImmutableBytesWritable, in contrast only allocates as many bytes as you pass
in and then does not allow the size to be changed.
> If we were to change from Text to a non-text key, my preference would be for ImmutableBytesWritable,
because it has a fixed size once set, and operations like get, etc do not have to something
like System.arrayCopy where you specify the number of bytes to copy.
> Your comments, questions are welcome on this issue. If we receive enough feedback that
Text is too restrictive, we are willing to change it, but we need to hear what would be the
most useful thing to change it to as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message