incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avinash Lakshman <avinash.laksh...@gmail.com>
Subject Re: OPHF
Date Wed, 01 Apr 2009 22:27:53 GMT
I will look into this. Another point about P2P rings is something about the
ability to reason about the system. In any hash fn we hve a notion of range
and how the ring wraps around. For eg. in random hash such as MD5 we say
0-2^128 -1 and then the wrap around. Similarly for the hash fn used here one
could define the same. Today we perform usual sort of strings and the hash
should respect that sort order. I guess you are saying the usual sort of
string does not respect collation. I will look into this and I insist this
should be very doable. Changes to the implementation of the core classes
albeit how simple they are very very scary unless they are completely tested
too in a distributed setting. Over here we do not have detailed test code,
but we test by directing a % of the site traffic to a test cluster before we
sign off on anything.

Avinash

On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and
> you are going to do range queries on them.  The correct
> collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
> char value gives ['--a', '--b', 'a', 'b'].
>
> Attached is a test program illustrating this.  The assert will fail
> because the hash ordering is not the same as the collator's.
>
> -Jonathan
>
> On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman
> <avinash.lakshman@gmail.com> wrote:
> > In fact what would you want is hash to respect oorder based on how the
> > strings are sorted. For the example you have it does. So am I missing
> > something?
> >
> > Avinash
> >
> > On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >
> >> For that aspect no difference between a String ring based on compareTo
> >> and a BigInteger one.  The only difference (and it is an important one
> >> for the reasons I gave!) is how the compare works.  But for the p2p
> >> aspect it does not matter.
> >>
> >> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman
> >> <avinash.lakshman@gmail.com> wrote:
> >> > Doing what you are suggesting scares the hell out of me for a couple
> of
> >> > reasons -  All work in P2P be it random/OPHF does the token handling
> the
> >> way
> >> > it is setup. I cannot try something that has not been well explored in
> >> > academia. I insist this must be doable. I am going to think about this
> >> more.
> >> >
> >> > Avinash
> >> >
> >> > On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis <jbellis@gmail.com>
> >> wrote:
> >> >
> >> >> Avinash already commited his new order-preserving hash function and
I
> >> >> missed it.  It's in OrderPreservingHashPartitioner.  It takes the
> >> >> approach that Todd and I discussed back in January: turn the string
> >> >> into a base-Char.MAX_VALUE number.
> >> >> (
> >> >>
> >>
> http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed
> >> >> ).
> >> >>  I chatted with Avinash a little on IM but we didn't finish, so I'm
> >> >> picking it up here.
> >> >>
> >> >> There are two problems with this approach.  One is that the hashes
> >> >> will only be order-preserving for a subset of unicode (all of UCS-2,
> >> >> but not all of UTF-16; see http://en.wikipedia.org/wiki/UTF-16/UCS-2
> ).
> >> >>  The other is that this only gives you a naive ordering by code point
> >> >> value, which for unicode is not what you want and even for ascii
> >> >> sometimes you want another collation.  (see
> >> >> http://www.unicode.org/reports/tr10/ and
> >> >> http://java.sun.com/javase/6/docs/api/java/text/Collator.html).
> >> >>
> >> >> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and
> >> >> you are going to do range queries on them.  The correct
> >> >> collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
> >> >> char value gives ['--a', '--b', 'a', 'b'].
> >> >>
> >> >> Switching to a more flexibile system like the one I wrote for
> >> >> CASSANDRA-3 will let use use Token<BigInteger> for random
> distribution
> >> >> or Token<String> for order-preserving, with user-defined collation.
>  I
> >> >> don't see a way to get this kind of flexibility from an approach that
> >> >> insists on turning everything into BigInteger.
> >> >>
> >> >> -Jonathan
> >> >>
> >> >> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis <jbellis@gmail.com>
> >> wrote:
> >> >> > Avinash,
> >> >> >
> >> >> > You mentioned that you have a new order-preserving hash function
> that
> >> >> > you think will be more generally useful.  Can you post it?
> >> >> >
> >> >> > thanks,
> >> >> >
> >> >> > -Jonathan
> >> >> >
> >> >>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message