Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D9FDBBB46 for ; Fri, 20 Jan 2012 09:33:20 +0000 (UTC) Received: (qmail 15628 invoked by uid 500); 20 Jan 2012 09:33:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 14468 invoked by uid 500); 20 Jan 2012 09:33:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 14426 invoked by uid 99); 20 Jan 2012 09:32:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jan 2012 09:32:59 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [213.218.26.130] (HELO fw.chors.de) (213.218.26.130) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jan 2012 09:32:51 +0000 Received: from mail.chors.de ([192.168.53.30]:34959) by fw.chors.de with esmtp (Exim 4.76) (envelope-from ) id 1RoApd-0000AN-06 for user@cassandra.apache.org; Fri, 20 Jan 2012 10:32:25 +0100 Received: from localhost (localhost.chors.de [127.0.0.1]) by mail.chors.de (Postfix) with ESMTP id A6EED290002 for ; Fri, 20 Jan 2012 10:31:22 +0100 (CET) Received: from mail.chors.de ([127.0.0.1]) by localhost (mail.chors.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gHLTbhwNJ9jx for ; Fri, 20 Jan 2012 10:31:22 +0100 (CET) Received: from [192.168.53.138] (unknown [192.168.53.138]) by mail.chors.de (Postfix) with ESMTPSA id 83661290001 for ; Fri, 20 Jan 2012 10:31:22 +0100 (CET) X-CTCH-RefID: str=0001.0A0B020D.4F1934A9.0035,ss=1,re=0.000,fgs=0 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: Unbalanced cluster with RandomPartitioner From: Marcel Steinbach In-Reply-To: Date: Fri, 20 Jan 2012 10:32:24 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <6E1DF80E-5B0C-46E3-8BE0-0A92AB18719A@chors.de> References: <73FEEF67-8BD2-4D5A-8D0C-6C6CC6FB14F2@chors.de> <4D6B44DB-F9E0-4AEB-887C-2D0BF3C25DFF@chors.de> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org On 19.01.2012, at 20:15, Narendra Sharma wrote: > I believe you need to move the nodes on the ring. What was the load on = the nodes before you added 5 new nodes? Its just that you are getting = data in certain token range more than others. With three nodes, it was also imbalanced.=20 What I don't understand is, why the md5 sums would generate such massive = hot spots.=20 Most of our keys look like that:=20 00013270494972450001234567 with the first 16 digits being a timestamp of one of our application = server's startup times, and the last 10 digits being sequentially = generated per user.=20 There may be a lot of keys that start with e.g. "0001327049497245" (or = some other time stamp). But I was under the impression that md5 doesn't = bother and generates uniform distribution? But then again, I know next to nothing about md5. Maybe someone else has = a better insight to the algorithm? However, we also use cfs with a date ("yyyymmdd") as key, as well as cfs = with uuids as keys. And those cfs in itself are not balanced either. = E.g. node 5 has 12 GB live space used in the cf the uuid as key, and = node 8 only 428MB.=20 Cheers, Marcel >=20 > On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach = wrote: > On 18.01.2012, at 02:19, Maki Watanabe wrote: >> Are there any significant difference of number of sstables on each = nodes? > No, no significant difference there. Actually, node 8 is among those = with more sstables but with the least load (20GB) >=20 > On 17.01.2012, at 20:14, Jeremiah Jordan wrote: >> Are you deleting data or using TTL's? Expired/deleted data won't go = away until the sstable holding it is compacted. So if compaction has = happened on some nodes, but not on others, you will see this. The = disparity is pretty big 400Gb to 20GB, so this probably isn't the issue, = but with our data using TTL's if I run major compactions a couple times = on that column family it can shrink ~30%-40%. > Yes, we do delete data. But I agree, the disparity is too big to blame = only the deletions.=20 >=20 > Also, initially, we started out with 3 nodes and upgraded to 8 a few = weeks ago. After adding the node, we did > compactions and cleanups and didn't have a balanced cluster. So that = should have removed outdated data, right? >=20 >> 2012/1/18 Marcel Steinbach : >>> We are running regular repairs, so I don't think that's the problem. >>> And the data dir sizes match approx. the load from the nodetool. >>> Thanks for the advise, though. >>>=20 >>> Our keys are digits only, and all contain a few zeros at the same >>> offsets. I'm not that familiar with the md5 algorithm, but I doubt = that it >>> would generate 'hotspots' for those kind of keys, right? >>>=20 >>> On 17.01.2012, at 17:34, Mohit Anchlia wrote: >>>=20 >>> Have you tried running repair first on each node? Also, verify using >>> df -h on the data dirs >>>=20 >>> On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach >>> wrote: >>>=20 >>> Hi, >>>=20 >>>=20 >>> we're using RP and have each node assigned the same amount of the = token >>> space. The cluster looks like that: >>>=20 >>>=20 >>> Address Status State Load Owns Token >>>=20 >>>=20 >>> 205648943402372032879374446248852460236 >>>=20 >>> 1 Up Normal 310.83 GB 12.50% >>> 56775407874461455114148055497453867724 >>>=20 >>> 2 Up Normal 470.24 GB 12.50% >>> 78043055807020109080608968461939380940 >>>=20 >>> 3 Up Normal 271.57 GB 12.50% >>> 99310703739578763047069881426424894156 >>>=20 >>> 4 Up Normal 282.61 GB 12.50% >>> 120578351672137417013530794390910407372 >>>=20 >>> 5 Up Normal 248.76 GB 12.50% >>> 141845999604696070979991707355395920588 >>>=20 >>> 6 Up Normal 164.12 GB 12.50% >>> 163113647537254724946452620319881433804 >>>=20 >>> 7 Up Normal 76.23 GB 12.50% >>> 184381295469813378912913533284366947020 >>>=20 >>> 8 Up Normal 19.79 GB 12.50% >>> 205648943402372032879374446248852460236 >>>=20 >>>=20 >>> I was under the impression, the RP would distribute the load more = evenly. >>>=20 >>> Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a = single >>> node. Should we just move the nodes so that the load is more even >>> distributed, or is there something off that needs to be fixed first? >>>=20 >>>=20 >>> Thanks >>>=20 >>> Marcel >>>=20 >>>
>>>=20 >>>

chors GmbH >>>=20 >>>


>>>=20 >>>

specialists in digital and direct marketing solutions
>>>=20 >>> Haid-und-Neu-Stra=DFe 7
>>>=20 >>> 76131 Karlsruhe, Germany
>>>=20 >>> www.chors.com

>>>=20 >>>

Managing Directors: Dr. Volker Hatz, Markus = Plattner
Amtsgericht >>> Montabaur, HRB 15029

>>>=20 >>>

This e-mail is for the intended recipient = only and >>> may contain confidential or privileged information. If you have = received >>> this e-mail by mistake, please contact us immediately and completely = delete >>> it (and any attachments) and do not forward it or inform any other = person of >>> its contents. If you send us messages by e-mail, we take this as = your >>> authorization to correspond with you by e-mail. E-mail transmission = cannot >>> be guaranteed to be secure or error-free as information could be >>> intercepted, amended, corrupted, lost, destroyed, arrive late or = incomplete, >>> or contain viruses. Neither chors GmbH nor the sender accept = liability for >>> any errors or omissions in the content of this message which arise = as a >>> result of its e-mail transmission. Please note that all e-mail >>> communications to and from chors GmbH may be monitored.

>>>=20 >>>=20 >>=20 >>=20 >>=20 >> --=20 >> w3m >=20 >=20 >=20 >=20 > --=20 > Narendra Sharma > Software Engineer > http://www.aeris.com > http://narendrasharma.blogspot.com/ >=20 >=20 =20=