cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Ibrahim <>
Subject Re: Order preserving partitioning strategy
Date Thu, 26 Aug 2010 15:47:06 GMT
Hi All,

There might be a simpler way to make the OPP achieve even, or close to even

The small change here is that the OPP has to use thresholds to distribute
keys instead of centers. Every node should have a MIN and a MAX threshold. A
key gets inserted in a node x if MIN_x<k<=MAX_x . Nodes share the thresholds
between them, so MAX_x = MIN_(x+1) for all x=0 to n-1.

If ever a key k is attempted to be inserted and k < MIN_0, then we set MIN_0
= k -1 . Similarly, if ever a key k is attempted to be inserted and k is >
MAX_(n-1), then set MAX_(n-1)=k

Those thresholds with such setup can be recalculated very easily to
redistribute the data evenly. Actually, after doing some thinking I came up
with two algorithms, one I call minor redistribution, and the other I called
major redistribution, and the goal of doing a redistribution is to achieve
an equal number of keys per node.

The minor redistribution algorithm does not require full scan and can
recalculate the thresholds very fast, but is approximate. The major
redistribution may require a full key scan (or partial depending on the
implementation) and will be able to exactly calculate the node thresholds to
achieve equal loads. Due to the full (or partial) key scan requirement, the
major redistribution will require longer time to process.

Minor redistribution
Step1: Calculate the desired load per node
    L= Total number of keys in the cluster / n

Step2: Update the max thresholds of nodes 0 to n-2 to achieve the average
load in every node
    n_x is a snapshot of the number of keys in node x
    Node average density D_x=Number of keys in node x / (MAX_x - MIN_x)
    If n_x > L then // We're moving the max threshold back into the node,
since it is overloaded
       New Max= MIN_x + L / D_x
       if (x<n-1) n_(x+1)+=n_x-L;
    else  // We're moving the max threshold into the next node, as the node
is under fulled. Use the next node's density for better approx.
       New Max= MAX_x + (L-n_x)/D_(x+1)
       if (x<n-1) n_(x+1)-=n_x-L;

After the new thresholds are calculated, then nodes should move the data.
The approximate here is the assumption that keys are evenly distributed over
the range of every node, and I chose that because it is the simplest in my
point of view. Since the data we have is already and incomplete data set (as
more keys are expected in the future), any assumption of any distribution
will have errors, so we rather use the simplest.

Major redistribution
This is actually much simpler to do. We know that we need every node to have
L keys (as calculated in the minor distribution). Starting from the smallest
key, move up L keys and set the max threshold, and by repeating we can
actually figure out the max threshold of every node. That where actually we
might need a full key scan, to implement this hopping of L keys to calculate
the max. threshold.

Hopefully this helps, or may be tickles some one else's brain to produce a
nicer idea.

Mohamed Ibrahim

On Thu, Aug 26, 2010 at 12:25 AM, J. Andrew Rogers <>wrote:

> Hi Jonathan,
> I've never seen a paper that discusses it as a primary topic, it is
> always in some other context. IIRC, the most recent discussions of it
> I have seen have been in join algorithm literature from somewhere in
> Asia. MPP analytical databases often implement some form of skew
> adaptivity but there is no standard way because the design tradeoffs
> are context dependent. DB2 also has a non-adaptive technique for
> dealing with skew that should be simple to implement on Cassandra and
> might provide an 80/20 option (more on that a little further on).
> Skew adaptivity is generally implemented with a mix of data structures
> along the lines of an adaptive quad-tree. The reason you only see this
> in analytical databases is that the data skew is unlikely to change
> much and/or have too much concurrent updating. If the distribution
> radically changes all of the time under high concurrency, it will
> create some mix of resource contention, lost selectivity, or runaway
> space consumption depending on implementation detail. The optimal mix
> of pain tends to be a compile-time option, so it isn't very flexible.
> Definitely not optimal for concurrent OLTP-ish workloads.
> Alternatively:
> IBM's DB2 has a couple different data organization options that
> essentially define partitionable skew invariants. The closer the real
> data distribution is to the skew invariant, the more access
> performance is like a distributed hash table. For many data models,
> the data distribution skew can be roughly predicted ahead of time and
> for those applications it is relatively efficient. You can take this
> pretty far; DB2 has libraries of skew invariants based on irregular
> Voronoi tesselation and I believe the most recent version of SQL
> Server has something similar. I think they even have tools for
> sampling representative data for the purposes of finding a good skew
> invariant.
> It is not much but I hope that helps.
> Data distribution skew handling only partly mitigates data access skew
> issues (e.g. temporal data) but it is better than nothing.
> --
> J. Andrew Rogers
> On Tue, Aug 24, 2010 at 10:30 AM, Jonathan Ellis <>
> wrote:
> > What are some good papers to read for background?
> >
> > On Tue, Aug 24, 2010 at 12:26 PM, J. Andrew Rogers
> > <> wrote:
> >> On Mon, Aug 23, 2010 at 8:36 PM, Hien. To Trong <>
> wrote:
> >>> OrderPreservingPartitioner is efficient range queries but can cause
> >>> unevently distributed data. Does anyone has an idea of a
> >>> HybridPartitioner which takes advantages of both RandomPartitioner
> >>> and OPP, or at least a partitioner trade off between them.
> >>
> >>
> >> What you are looking for is skew adaptive partitioning i.e. like a
> >> B+Tree except distributable.
> >>
> >> A couple different methods for doing something like this exist, but
> >> you rarely see them and they have their own (different) tradeoffs. To
> >> the best of my knowledge, implementation requires a fairly deep
> >> architectural commitment; it is more involved than simply defining a
> >> partitioning function and the "adaptive" aspect must be distribution
> >> friendly. It is an active area of research in the literature with no
> >> obvious and simple solutions that can be lashed onto a database engine
> >> "as is".
> >>
> >> --
> >> J. Andrew Rogers
> >>
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message