Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: Unbalanced cluster with RandomPartitioner
From: Marcel Steinbach <marcel.steinbach@chors.de>
In-Reply-To: 
 <CAHamXHF=tVVXcPjNSD2M65L58pQJ3J5a39y0jx3brVn+dgeZig@mail.gmail.com>
Date: Fri, 20 Jan 2012 10:32:24 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <6E1DF80E-5B0C-46E3-8BE0-0A92AB18719A@chors.de>
References: <73FEEF67-8BD2-4D5A-8D0C-6C6CC6FB14F2@chors.de>
 <CAOT3TWoYAqk4thdvM2fzKZ-G7Q6K=c2ah46hhKMLKBdwuu8kjA@mail.gmail.com>
 <4D6B44DB-F9E0-4AEB-887C-2D0BF3C25DFF@chors.de>
 <CALqbeQavzQjDsfLuLHYB_26MhowdGh1B4N18NC_frMZe0=NOsQ@mail.gmail.com>
 <A7257D0F-84D3-4E46-83C6-59BB670DA9EE@chors.de>
 <CAHamXHF=tVVXcPjNSD2M65L58pQJ3J5a39y0jx3brVn+dgeZig@mail.gmail.com>
To: user@cassandra.apache.org

On 19.01.2012, at 20:15, Narendra Sharma wrote:
> I believe you need to move the nodes on the ring. What was the load on =
the nodes before you added 5 new nodes? Its just that you are getting =
data in certain token range more than others.
With three nodes, it was also imbalanced.=20

What I don't understand is, why the md5 sums would generate such massive =
hot spots.=20

Most of our keys look like that:=20
00013270494972450001234567
with the first 16 digits being a timestamp of one of our application =
server's startup times, and the last 10 digits being sequentially =
generated per user.=20

There may be a lot of keys that start with e.g. "0001327049497245"  (or =
some other time stamp). But I was under the impression that md5 doesn't =
bother and generates uniform distribution?
But then again, I know next to nothing about md5. Maybe someone else has =
a better insight to the algorithm?

However, we also use cfs with a date ("yyyymmdd") as key, as well as cfs =
with uuids as keys. And those cfs in itself are not balanced either. =
E.g. node 5 has 12 GB live space used in the cf the uuid as key, and =
node 8 only 428MB.=20

Cheers,
Marcel

>=20
> On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach =
<marcel.steinbach@chors.de> wrote:
> On 18.01.2012, at 02:19, Maki Watanabe wrote:
>> Are there any significant difference of number of sstables on each =
nodes?
> No, no significant difference there. Actually, node 8 is among those =
with more sstables but with the least load (20GB)
>=20
> On 17.01.2012, at 20:14, Jeremiah Jordan wrote:
>> Are you deleting data or using TTL's?  Expired/deleted data won't go =
away until the sstable holding it is compacted.  So if compaction has =
happened on some nodes, but not on others, you will see this.  The =
disparity is pretty big 400Gb to 20GB, so this probably isn't the issue, =
but with our data using TTL's if I run major compactions a couple times =
on that column family it can shrink ~30%-40%.
> Yes, we do delete data. But I agree, the disparity is too big to blame =
only the deletions.=20
>=20
> Also, initially, we started out with 3 nodes and upgraded to 8 a few =
weeks ago. After adding the node, we did
> compactions and cleanups and didn't have a balanced cluster. So that =
should have removed outdated data, right?
>=20
>> 2012/1/18 Marcel Steinbach <marcel.steinbach@chors.de>:
>>> We are running regular repairs, so I don't think that's the problem.
>>> And the data dir sizes match approx. the load from the nodetool.
>>> Thanks for the advise, though.
>>>=20
>>> Our keys are digits only, and all contain a few zeros at the same
>>> offsets. I'm not that familiar with the md5 algorithm, but I doubt =
that it
>>> would generate 'hotspots' for those kind of keys, right?
>>>=20
>>> On 17.01.2012, at 17:34, Mohit Anchlia wrote:
>>>=20
>>> Have you tried running repair first on each node? Also, verify using
>>> df -h on the data dirs
>>>=20
>>> On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach
>>> <marcel.steinbach@chors.de> wrote:
>>>=20
>>> Hi,
>>>=20
>>>=20
>>> we're using RP and have each node assigned the same amount of the =
token
>>> space. The cluster looks like that:
>>>=20
>>>=20
>>> Address         Status State   Load            Owns    Token
>>>=20
>>>=20
>>> 205648943402372032879374446248852460236
>>>=20
>>> 1       Up     Normal  310.83 GB       12.50%
>>>  56775407874461455114148055497453867724
>>>=20
>>> 2       Up     Normal  470.24 GB       12.50%
>>>  78043055807020109080608968461939380940
>>>=20
>>> 3       Up     Normal  271.57 GB       12.50%
>>>  99310703739578763047069881426424894156
>>>=20
>>> 4       Up     Normal  282.61 GB       12.50%
>>>  120578351672137417013530794390910407372
>>>=20
>>> 5       Up     Normal  248.76 GB       12.50%
>>>  141845999604696070979991707355395920588
>>>=20
>>> 6       Up     Normal  164.12 GB       12.50%
>>>  163113647537254724946452620319881433804
>>>=20
>>> 7       Up     Normal  76.23 GB        12.50%
>>>  184381295469813378912913533284366947020
>>>=20
>>> 8       Up     Normal  19.79 GB        12.50%
>>>  205648943402372032879374446248852460236
>>>=20
>>>=20
>>> I was under the impression, the RP would distribute the load more =
evenly.
>>>=20
>>> Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a =
single
>>> node. Should we just move the nodes so that the load is more even
>>> distributed, or is there something off that needs to be fixed first?
>>>=20
>>>=20
>>> Thanks
>>>=20
>>> Marcel
>>>=20
>>> <hr style=3D"border-color:blue">
>>>=20
>>> <p>chors GmbH
>>>=20
>>> <br><hr style=3D"border-color:blue">
>>>=20
>>> <p>specialists in digital and direct marketing solutions<br>
>>>=20
>>> Haid-und-Neu-Stra=DFe 7<br>
>>>=20
>>> 76131 Karlsruhe, Germany<br>
>>>=20
>>> www.chors.com</p>
>>>=20
>>> <p>Managing Directors: Dr. Volker Hatz, Markus =
Plattner<br>Amtsgericht
>>> Montabaur, HRB 15029</p>
>>>=20
>>> <p style=3D"font-size:9px">This e-mail is for the intended recipient =
only and
>>> may contain confidential or privileged information. If you have =
received
>>> this e-mail by mistake, please contact us immediately and completely =
delete
>>> it (and any attachments) and do not forward it or inform any other =
person of
>>> its contents. If you send us messages by e-mail, we take this as =
your
>>> authorization to correspond with you by e-mail. E-mail transmission =
cannot
>>> be guaranteed to be secure or error-free as information could be
>>> intercepted, amended, corrupted, lost, destroyed, arrive late or =
incomplete,
>>> or contain viruses. Neither chors GmbH nor the sender accept =
liability for
>>> any errors or omissions in the content of this message which arise =
as a
>>> result of its e-mail transmission. Please note that all e-mail
>>> communications to and from chors GmbH may be monitored.</p>
>>>=20
>>>=20
>>=20
>>=20
>>=20
>> --=20
>> w3m
>=20
>=20
>=20
>=20
> --=20
> Narendra Sharma
> Software Engineer
> http://www.aeris.com
> http://narendrasharma.blogspot.com/
>=20
>=20
=20=