Return-Path: Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: (qmail 38016 invoked from network); 26 Aug 2010 19:02:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Aug 2010 19:02:56 -0000 Received: (qmail 73284 invoked by uid 500); 26 Aug 2010 19:02:55 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 73185 invoked by uid 500); 26 Aug 2010 19:02:54 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Delivered-To: moderator for dev@cassandra.apache.org Received: (qmail 49198 invoked by uid 99); 26 Aug 2010 18:48:39 -0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) MIME-Version: 1.0 Sender: mibrahim@mibrahim.net In-Reply-To: <1282844972.502517600@192.168.2.228> References: <90C9BAECF271EE4399D936115703855BB86B558B94@MBXFOC.vinagame.com.vn> <1282844972.502517600@192.168.2.228> Date: Thu, 26 Aug 2010 14:48:13 -0400 X-Google-Sender-Auth: qqzF5cjCNJ73zekIYMuQBfeRbic Message-ID: Subject: Re: Order preserving partitioning strategy From: Mohamed Ibrahim To: dev@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6471a58c82a31048ebe705d --0016e6471a58c82a31048ebe705d Content-Type: text/plain; charset=UTF-8 Hi Nick, My understanding of the tokens in Cassandra is that the key is inserted in the node which has the closest token [1]. This is very similar to clustering with the k-means approach, where every vector gets assigned to the cluster with the closest center. That is not equivalent to the min/max range approach because a node can assume a tight min/max range with dense keys (node a), while the following node can assume a longer range with less density (node b). At the end a and b have equal number of keys. If you now consider the keys lying just inside b's range by the border of a. If you are distributing keys using tokens, they might be closer to the token of node a - the tight region. That's why both approaches are not equivalent, and will actually yield different results. And for that reason if you use tokens, node a might get overloaded. I will take a look at the cited paper (Rhul's paper) and will also look at the other ticket. [1] http://wiki.apache.org/cassandra/StorageConfiguration ( under config overview >> partitioner ) Thanks for the reply, Mohamed Ibrahim On Thu, Aug 26, 2010 at 1:49 PM, Nick Bailey wrote: > Tokens are really no different than thresholds. Your token is your min and > your neighbors token is your max. To change your min, you move your token. > To change your max you move your neighbors token. > > Your idea of calculating optimal number of keys is similar to the load > balancing idea described on > https://issues.apache.org/jira/browse/CASSANDRA-1418 > > On Thursday, August 26, 2010 10:47am, "Mohamed Ibrahim" < > mibrahim@clker.com> said: > > > Hi All, > > > > There might be a simpler way to make the OPP achieve even, or close to > even > > loads. > > > > The small change here is that the OPP has to use thresholds to distribute > > keys instead of centers. Every node should have a MIN and a MAX > threshold. A > > key gets inserted in a node x if MIN_x thresholds > > between them, so MAX_x = MIN_(x+1) for all x=0 to n-1. > > > > If ever a key k is attempted to be inserted and k < MIN_0, then we set > MIN_0 > > = k -1 . Similarly, if ever a key k is attempted to be inserted and k is > > > > MAX_(n-1), then set MAX_(n-1)=k > > > > Those thresholds with such setup can be recalculated very easily to > > redistribute the data evenly. Actually, after doing some thinking I came > up > > with two algorithms, one I call minor redistribution, and the other I > called > > major redistribution, and the goal of doing a redistribution is to > achieve > > an equal number of keys per node. > > > > The minor redistribution algorithm does not require full scan and can > > recalculate the thresholds very fast, but is approximate. The major > > redistribution may require a full key scan (or partial depending on the > > implementation) and will be able to exactly calculate the node thresholds > to > > achieve equal loads. Due to the full (or partial) key scan requirement, > the > > major redistribution will require longer time to process. > > > > Minor redistribution > > ---------------------------- > > Step1: Calculate the desired load per node > > L= Total number of keys in the cluster / n > > > > Step2: Update the max thresholds of nodes 0 to n-2 to achieve the average > > load in every node > > n_x is a snapshot of the number of keys in node x > > Node average density D_x=Number of keys in node x / (MAX_x - MIN_x) > > If n_x > L then // We're moving the max threshold back into the node, > > since it is overloaded > > New Max= MIN_x + L / D_x > > if (x > else // We're moving the max threshold into the next node, as the > node > > is under fulled. Use the next node's density for better approx. > > New Max= MAX_x + (L-n_x)/D_(x+1) > > if (x > > > After the new thresholds are calculated, then nodes should move the data. > > The approximate here is the assumption that keys are evenly distributed > over > > the range of every node, and I chose that because it is the simplest in > my > > point of view. Since the data we have is already and incomplete data set > (as > > more keys are expected in the future), any assumption of any distribution > > will have errors, so we rather use the simplest. > > > > Major redistribution > > ---------------------------- > > This is actually much simpler to do. We know that we need every node to > have > > L keys (as calculated in the minor distribution). Starting from the > smallest > > key, move up L keys and set the max threshold, and by repeating we can > > actually figure out the max threshold of every node. That where actually > we > > might need a full key scan, to implement this hopping of L keys to > calculate > > the max. threshold. > > > > Hopefully this helps, or may be tickles some one else's brain to produce > a > > nicer idea. > > > > Best, > > Mohamed Ibrahim > > > > On Thu, Aug 26, 2010 at 12:25 AM, J. Andrew Rogers < > jar.mailbox@gmail.com>wrote: > > > >> Hi Jonathan, > >> > >> I've never seen a paper that discusses it as a primary topic, it is > >> always in some other context. IIRC, the most recent discussions of it > >> I have seen have been in join algorithm literature from somewhere in > >> Asia. MPP analytical databases often implement some form of skew > >> adaptivity but there is no standard way because the design tradeoffs > >> are context dependent. DB2 also has a non-adaptive technique for > >> dealing with skew that should be simple to implement on Cassandra and > >> might provide an 80/20 option (more on that a little further on). > >> > >> Skew adaptivity is generally implemented with a mix of data structures > >> along the lines of an adaptive quad-tree. The reason you only see this > >> in analytical databases is that the data skew is unlikely to change > >> much and/or have too much concurrent updating. If the distribution > >> radically changes all of the time under high concurrency, it will > >> create some mix of resource contention, lost selectivity, or runaway > >> space consumption depending on implementation detail. The optimal mix > >> of pain tends to be a compile-time option, so it isn't very flexible. > >> Definitely not optimal for concurrent OLTP-ish workloads. > >> > >> Alternatively: > >> > >> IBM's DB2 has a couple different data organization options that > >> essentially define partitionable skew invariants. The closer the real > >> data distribution is to the skew invariant, the more access > >> performance is like a distributed hash table. For many data models, > >> the data distribution skew can be roughly predicted ahead of time and > >> for those applications it is relatively efficient. You can take this > >> pretty far; DB2 has libraries of skew invariants based on irregular > >> Voronoi tesselation and I believe the most recent version of SQL > >> Server has something similar. I think they even have tools for > >> sampling representative data for the purposes of finding a good skew > >> invariant. > >> > >> It is not much but I hope that helps. > >> > >> Data distribution skew handling only partly mitigates data access skew > >> issues (e.g. temporal data) but it is better than nothing. > >> > >> -- > >> J. Andrew Rogers > >> > >> > >> > >> On Tue, Aug 24, 2010 at 10:30 AM, Jonathan Ellis > >> wrote: > >> > What are some good papers to read for background? > >> > > >> > On Tue, Aug 24, 2010 at 12:26 PM, J. Andrew Rogers > >> > wrote: > >> >> On Mon, Aug 23, 2010 at 8:36 PM, Hien. To Trong > >> wrote: > >> >>> OrderPreservingPartitioner is efficient range queries but can cause > >> >>> unevently distributed data. Does anyone has an idea of a > >> >>> HybridPartitioner which takes advantages of both RandomPartitioner > >> >>> and OPP, or at least a partitioner trade off between them. > >> >> > >> >> > >> >> What you are looking for is skew adaptive partitioning i.e. like a > >> >> B+Tree except distributable. > >> >> > >> >> A couple different methods for doing something like this exist, but > >> >> you rarely see them and they have their own (different) tradeoffs. To > >> >> the best of my knowledge, implementation requires a fairly deep > >> >> architectural commitment; it is more involved than simply defining a > >> >> partitioning function and the "adaptive" aspect must be distribution > >> >> friendly. It is an active area of research in the literature with no > >> >> obvious and simple solutions that can be lashed onto a database > engine > >> >> "as is". > >> >> > >> >> -- > >> >> J. Andrew Rogers > >> >> > >> > > >> > > >> > > >> > -- > >> > Jonathan Ellis > >> > Project Chair, Apache Cassandra > >> > co-founder of Riptano, the source for professional Cassandra support > >> > http://riptano.com > >> > > >> > > > > > --0016e6471a58c82a31048ebe705d--