Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 57867 invoked from network); 9 Feb 2010 06:51:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Feb 2010 06:51:39 -0000 Received: (qmail 82655 invoked by uid 500); 9 Feb 2010 06:51:38 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 82610 invoked by uid 500); 9 Feb 2010 06:51:38 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 82600 invoked by uid 99); 9 Feb 2010 06:51:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Feb 2010 06:51:37 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rosvopaallikko@gmail.com designates 209.85.216.175 as permitted sender) Received: from [209.85.216.175] (HELO mail-px0-f175.google.com) (209.85.216.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Feb 2010 06:51:29 +0000 Received: by pxi5 with SMTP id 5so7512964pxi.12 for ; Mon, 08 Feb 2010 22:51:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=E3+srXz2MNzsTpIJ8nQSU0InjZOpJByEuivZEy24sAI=; b=hnZBFZyuWDEeedcRlP9Cke/bpE2cCO2pbURYd6kkAhNjnMaeWbXciNss6b2sqHR6oA xd68SQLjhd7bnMGW6xtEmKO2G3OFcLItpe9SwgXzuPC2upZTRi9YSSZv9B4QvE01Empc iG5KFdzgQfWhs6D0W+p3E2ZiDGg4fhqaK1glc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=p8OqV/8g6icR9o75bGIU0YjSlJ/hGwXnH5byc4OxDJprTc+rwjXplXxABC3oio7YkI chhticzk87ZOpWjFiwu6fveXGr3a9mVnQHcL3Y/a/bh07+nHZcoW/JuEa1UZFAUl3gR2 OoJKxaYUccHc15gT8BYfJyeC8tNUviwPnBVeg= MIME-Version: 1.0 Received: by 10.114.253.33 with SMTP id a33mr5161805wai.167.1265698267902; Mon, 08 Feb 2010 22:51:07 -0800 (PST) Date: Tue, 9 Feb 2010 15:51:07 +0900 Message-ID: Subject: loadbalance and different strategies From: Jaakko To: cassandra-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi, Current implementation of loadbalance seems to work only for RackUnaware. Problem is only RackUnaware creates a "flat" replica space where all nodes are equal. This is not true for other strategies, since due to rack and DC considerations, replicas are not evenly distributed among nodes. To illustrate the problem, let us consider the following scenario: - cluster with nodes A through H. Nodes B and G are in DC1, rest of the nodes in DC2. - DC shard strategy, factor 3 (without loss of generality we can omit rack considerations from this discussion). In this situation ranges would be (node, primary, replicas): A: H-A, F-G B: A-B, H-A C: B-C, H-A, A-B D: C-D, B-C E: D-E, C-D F: E-F, D-E G: F-G, A-B, B-C, C-D, D-E, E-F H: G-H, E-F, F-G Now in this situation most likely node G is by far the most loaded one, so if a node bootstraps (either a new one, or a loadbalance operation), it will take half of G's range. Problem is, it will take half of G's _primary_ range, but most of G's load comes from _replicas_. After this operation, the ring would be (X denotes the new node): A: H-A, F-G B: A-B, H-A C: B-C, H-A, A-B D: C-D, B-C E: D-E, C-D F: E-F, D-E X: F-X, A-B, B-C, C-D, D-E, E-F G: X-G, F-X H: G-H, E-F, F-X, X-G It is clear from this, that the situation has not really improved. The only difference is that X is now the most loaded node and G probably has a very light load. If another new node arrives, it will again go in front of X, but the situation will remain largely the same. In order to get rid of such replica sinks, nodes would need to consider also replica "categories". When we're looking for replica destinations, we essentially consider categories "in other DC", "in other rack" and "anywhere". When node X boots in the ring above, it should not just consider what is G's primary range, but what is G's effective range (primary plus replicas). Amount of replicas is determined largely by nodes that belong to the same replica category. If X belongs in DC1 (same as B and G), best balance would be gained if X booted in the middle of B and G, as that would divide replicas evenly. This might not always be the best place, because individual ranges might be very much different. In order to fix this, the ideal solution would be to modify load string so that, instead of total load, it reports both load from primary range and load from replicas. This would allow bootstrapping node to decide whether it should take half of replicas or half of the primary range in order to get optimal result. However, there is no way to get these two numbers, so we only have total load number. It would not be perfect, but perhaps for now it would be best to only consider nodes from the same category when making load balancing decisions. That is, for rack unaware we consider all nodes as always, but for other strategies we would determine the bootstrap token based on which nodes are in the same category. Don't know if this would work, but should be at least better than now. Another related issue is: now that we have strategy per table, how should we approach load balancing? Optimal decision for one strategy might be bad for another strategy. If we have just one strategy in use, that's clear, but for multiple strategies we'd need to determine which one to favor. Or am I thinking about this in a completely wrong way? -Jaakko