hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Pallas <pal...@cs.stanford.edu>
Subject Re: How to scan rows starting with a particular string?
Date Wed, 27 Apr 2011 17:00:00 GMT

On Apr 26, 2011, at 11:54 PM, Himanshu Vashishtha wrote:

> HBase uses utf-8 encoding to store the row keys, so it can store non-ascii
> characters too (yes they will be larger than 1 byte).

That statement may be misleading.  HBase doesn't use any encoding at all, because row keys
are simply arrays of bytes.  HBase cares only about the sorting order of those byte arrays,
and neither knows nor cares what interpretation the client may attach to them.

The UTF-8 standard mentions that the byte-value lexicographic sorting order of UTF-8 strings
matches the sorting order of the Unicode character numbers, so a client can turn 16- or 32-bit
Unicode strings into UTF-8 in order to use them as keys and they will sort the same way. 
(Although the standard warns that "a sort order based on character numbers is almost never
culturally valid.")

On the plus side, that means you never have to worry about "What's the next character after
ç?"  Just add 1.  But don't be surprised when "fad" comes before "façade" in your sort.

joe


Mime
View raw message