incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: OPHF
Date Wed, 01 Apr 2009 21:31:58 GMT
Avinash already commited his new order-preserving hash function and I
missed it.  It's in OrderPreservingHashPartitioner.  It takes the
approach that Todd and I discussed back in January: turn the string
into a base-Char.MAX_VALUE number.
(http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed).
 I chatted with Avinash a little on IM but we didn't finish, so I'm
picking it up here.

There are two problems with this approach.  One is that the hashes
will only be order-preserving for a subset of unicode (all of UCS-2,
but not all of UTF-16; see http://en.wikipedia.org/wiki/UTF-16/UCS-2).
 The other is that this only gives you a naive ordering by code point
value, which for unicode is not what you want and even for ascii
sometimes you want another collation.  (see
http://www.unicode.org/reports/tr10/ and
http://java.sun.com/javase/6/docs/api/java/text/Collator.html).

Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and
you are going to do range queries on them.  The correct
collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
char value gives ['--a', '--b', 'a', 'b'].

Switching to a more flexibile system like the one I wrote for
CASSANDRA-3 will let use use Token<BigInteger> for random distribution
or Token<String> for order-preserving, with user-defined collation.  I
don't see a way to get this kind of flexibility from an approach that
insists on turning everything into BigInteger.

-Jonathan

On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
> Avinash,
>
> You mentioned that you have a new order-preserving hash function that
> you think will be more generally useful.  Can you post it?
>
> thanks,
>
> -Jonathan
>

Mime
View raw message