Return-Path: X-Original-To: apmail-helix-user-archive@minotaur.apache.org Delivered-To: apmail-helix-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4492D196C9 for ; Fri, 25 Mar 2016 19:20:40 +0000 (UTC) Received: (qmail 59662 invoked by uid 500); 25 Mar 2016 19:20:40 -0000 Delivered-To: apmail-helix-user-archive@helix.apache.org Received: (qmail 59565 invoked by uid 500); 25 Mar 2016 19:20:39 -0000 Mailing-List: contact user-help@helix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@helix.apache.org Delivered-To: mailing list user@helix.apache.org Received: (qmail 59555 invoked by uid 99); 25 Mar 2016 19:20:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Mar 2016 19:20:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 610051A4ECF for ; Fri, 25 Mar 2016 19:20:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id J7hxiKLrEm_K for ; Fri, 25 Mar 2016 19:20:36 +0000 (UTC) Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com [74.125.82.50]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 92E755F22E for ; Fri, 25 Mar 2016 19:20:35 +0000 (UTC) Received: by mail-wm0-f50.google.com with SMTP id u125so26696027wmg.1 for ; Fri, 25 Mar 2016 12:20:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=3ziSmjVYWjtocSn3i111IOigUChnt0HW5gThgYtTvFc=; b=aQLoWpEmC0serx8XFKok0Ia1BOxZ29LnwVW3P9UaR+JhoUZnDvTtuEuc2bAsxDDKvB fs/lF9zxkL/DyGLS0SRiLxobQnqMDdNZYts4Cbd+r+JhoTESJysLJ6Esy0Pg56nXQrUM A6n3Yk3vwLZxMwKkxZ9oDv9IlA8Q+MJNAZ+MxnAKdvLtsq8GKTOa+IlAiwBK6LikMgEX 3Zysdw9BPt2Wy082EIrZS9zfaz3COy8KGmKblg4fk0mrD4cYUe9bk4IOsPeR4YYIEQ++ k+GVnEpLsMW682xyDiTXKfuixiT9PDLbkIUFOfkjYN5Y5WCt41grpq/Yw47cjVIPD8xv n/cA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=3ziSmjVYWjtocSn3i111IOigUChnt0HW5gThgYtTvFc=; b=ImhPvIPcMdmR4zjYsDjlki94mdwt7Ad5YMFM9eIxjbHeIX0Ua3QH3Lc252Rnonj/ry dWnjQrWAr7uBNilmFbAJchqdcZIG34pyI2V26td9T+12q4qr8AAV/ZSqBg35s/P5H6hx A3qxxde8OPYZGUVrxqYeu/GO6mTKtCokrF9bwGM2CDRKNk8uZjCwpQy/IJ1YYg3SFJXk nj+oVW0np5ZEFCDVaTvQDDGBKfDY2YyFtcrfe+FYy6Yi1ObtAS2Ag2NxJjJ5NIWMUfAU 2/N/lOBK7F26FHjGGvsSNfbAYVYzDgTM2W5aleh9rVK5px84VdOCSoygLKuST4FSiS0h QbJQ== X-Gm-Message-State: AD7BkJJGGZNzOcg7SiUYwMCOJQI+wQidYHttqMlv4caf8zEGXKEd7caweIM21yBfg1bDw6aiCTr5xKE9vFJc3Q== MIME-Version: 1.0 X-Received: by 10.28.223.70 with SMTP id w67mr110309wmg.92.1458933634096; Fri, 25 Mar 2016 12:20:34 -0700 (PDT) Received: by 10.194.51.1 with HTTP; Fri, 25 Mar 2016 12:20:34 -0700 (PDT) In-Reply-To: References: Date: Fri, 25 Mar 2016 12:20:34 -0700 Message-ID: Subject: Re: Balancing out skews in FULL_AUTO mode with built-in rebalancer From: kishore g To: "user@helix.apache.org" Content-Type: multipart/alternative; boundary=001a114b100a02e65c052ee475f7 --001a114b100a02e65c052ee475f7 Content-Type: text/plain; charset=UTF-8 so computeOrphans is the one thats causing the behavior. In the beginning when nothing is assigned, all replicas are considered as orphans. Once they are considered as Orphan, they get assigned to any random node (this overrides everything thats computed by the placement scheme) I think the logic in computeOrphaned is broken, a replica should be treated as Orphan if the preferred node is not part of live node list. Try this in computeOrphaned. Note, the test case might fail because of this change and you will might have to change that according to new behavior. I think it will be good to introduce this behavior based on cluster config parameter. private Set computeOrphaned() { Set orphanedPartitions = new TreeSet(); for(Entry entry:_preferredAssignment.entrySet()){ if(!_liveNodesList.contains(entry.getValue())){ orphanedPartitions.add(entry.getKey()); } } for (Replica r : _existingPreferredAssignment.keySet()) { if (orphanedPartitions.contains(r)) { orphanedPartitions.remove(r); } } for (Replica r : _existingNonPreferredAssignment.keySet()) { if (orphanedPartitions.contains(r)) { orphanedPartitions.remove(r); } } return orphanedPartitions; } On Fri, Mar 25, 2016 at 8:41 AM, Vinoth Chandar wrote: > Here you go > > https://gist.github.com/vinothchandar/18feedfa84650e3efdc0 > > > On Fri, Mar 25, 2016 at 8:32 AM, kishore g wrote: > >> Can you point me to your code. fork/patch? >> >> On Fri, Mar 25, 2016 at 5:26 AM, Vinoth Chandar wrote: >> >>> Hi Kishore, >>> >>> Printed out more information and trimmed the test down to 1 resource >>> with 2 partitions, and I bring up 8 servers in parallel. >>> >>> Below is the paste of my logging output + annotations. >>> >>> >>> Computing partition assignment >>> >>>> NodeShift for countLog-2a 0 is 5, index 5 >>> >>>> NodeShift for countLog-2a 1 is 5, index 6 >>> >>> VC: So this part seems fine. We pick nodes at index 5 & 6 instead of 0, 1 >>> >>> >>>> Preferred Assignment: {countLog-2a_0|0=########## >>> name=localhost-server-6 >>> preferred:0 >>> nonpreferred:0, countLog-2a_1|0=########## >>> name=localhost-server-7 >>> preferred:0 >>> nonpreferred:0} >>> >>> VC: This translates to server-6/server-7 (since I named them starting 1) >>> >>> >>>> Existing Preferred Assignment: {} >>> >>>> Existing Non Preferred Assignment: {} >>> >>>> Orphaned: [countLog-2a_0|0, countLog-2a_1|0] >>> >>> Final State Map :{0=ONLINE} >>> >>>> Final ZK record : countLog-2a, >>> {}{countLog-2a_0={localhost-server-1=ONLINE}, >>> countLog-2a_1={localhost-server-1=ONLINE}}{countLog-2a_0=[localhost-server-1], >>> countLog-2a_1=[localhost-server-1]} >>> >>> VC: But the final effect still seems to be assigning the partitions to >>> servers 1 & 2 (first two). >>> >>> Any ideas on where to start poking? >>> >>> >>> Thanks >>> Vinoth >>> >>> On Tue, Mar 15, 2016 at 5:52 PM, Vinoth Chandar wrote: >>> >>>> Hi Kishore, >>>> >>>> I think the changes I made are exercised when computing the preferred >>>> assignment, later when the reconciliation happens with existing >>>> assignment/orphaned partitions etc, I think it does not take effect. >>>> >>>> The effective assignment I saw was all partitions (2 per resource) were >>>> assigned to first 2 servers. I started to dig into the above mentioned >>>> parts of the code, will report back tmrw when I pick this back up. >>>> >>>> Thanks, >>>> Vinoth >>>> >>>> _____________________________ >>>> From: kishore g >>>> Sent: Tuesday, March 15, 2016 2:01 PM >>>> Subject: Re: Balancing out skews in FULL_AUTO mode with built-in >>>> rebalancer >>>> To: >>>> >>>> >>>> >>>> 1) I am guessing it gets overriden by other logic in >>>> computePartitionAssignment(..), the end assignment is still skewed. >>>> >>>> What is the logic you are referring to? >>>> >>>> Can you print the assignment count for your use case? >>>> >>>> >>>> thanks, >>>> Kishore G >>>> >>>> On Tue, Mar 15, 2016 at 1:45 PM, Vinoth Chandar >>>> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> We are hitting a fairly known issue where we have 100s of resource >>>>> with < 8 resources spreading across 10 servers and the built-in assignment >>>>> always assigns partitions from first to last, resulting in heavy skew for a >>>>> few nodes. >>>>> >>>>> Chatted with Kishore offline and made a patch as here >>>>> .Tested >>>>> with 5 resources with 2 partitions each across 8 servers, logging out the >>>>> nodeShift & ultimate index picked does indicate that we choose servers >>>>> other than the first two, which is good >>>>> >>>>> But >>>>> 1) I am guessing it gets overriden by other logic in >>>>> computePartitionAssignment(..), the end assignment is still skewed. >>>>> 2) Even with murmur hash, there is some skew on the nodeshift, which >>>>> needs to ironed out. >>>>> >>>>> I will keep chipping at this.. Any feedback appreciated >>>>> >>>>> Thanks >>>>> Vinoth >>>>> >>>> >>>> >>>> >>>> >>> >> > --001a114b100a02e65c052ee475f7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
so computeOrphans is the one thats causing the behavior.
In the beginning when nothing is assigned, all replicas a= re considered as orphans. Once they are considered as Orphan, they get assi= gned to any random node (this overrides everything thats computed by the pl= acement scheme)

I think the logic in computeOrphan= ed is broken, a replica should be treated as Orphan if the preferred node i= s not part of live node list.

Try this in computeO= rphaned. Note, the test case might fail because of this change and you will= might have to change that according to new behavior. I think it will be go= od to introduce this behavior based on cluster config parameter.
=
=C2=A0private Set<Replica> computeOrphaned() {
=C2=A0 =C2=A0 Set<Replica> orphanedPartitions =3D new TreeSet= <Replica>();
=C2=A0 =C2=A0 for(Entry<Replica, Node> e= ntry:_preferredAssignment.entrySet()){
=C2=A0 =C2=A0 =C2=A0 if(!_= liveNodesList.contains(entry.getValue())){
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 orphanedPartitions.add(entry.getKey());
=C2=A0 =C2=A0 =C2= =A0 }
=C2=A0 =C2=A0 }
=C2=A0 =C2=A0 for (Replica r : _e= xistingPreferredAssignment.keySet()) {
=C2=A0 =C2=A0 =C2=A0 if (o= rphanedPartitions.contains(r)) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 orph= anedPartitions.remove(r);
=C2=A0 =C2=A0 =C2=A0 }
=C2=A0= =C2=A0 }
=C2=A0 =C2=A0 for (Replica r : _existingNonPreferredAss= ignment.keySet()) {
=C2=A0 =C2=A0 =C2=A0 if (orphanedPartitions.c= ontains(r)) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 orphanedPartitions.remo= ve(r);
=C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 }

=C2=A0 =C2=A0 return orphanedPartitions;
=C2=A0 }=

On Fri, Mar 25, 2016 at 8:41 AM, Vinoth Chandar <vinoth@uber.com> wrote:

On Fri, Mar 25, 2016 at 8:3= 2 AM, kishore g <g.kishore@gmail.com> wrote:
Can you point me to your code. fork/p= atch?

On Fri, Mar 25, 2016 at 5:26 AM, Vinoth Chandar <<= a href=3D"mailto:vinoth@uber.com" target=3D"_blank">vinoth@uber.com>= wrote:
Hi Kishore,

Printed out more information and trimmed th= e test down to 1 resource with 2 partitions, and I bring up 8 servers in pa= rallel.

Below is the paste of my logging output + annota= tions.

>>> Computing partition assignment
&g= t;>>> NodeShift for countLog-2a 0 is 5, index 5
>>>&g= t; NodeShift for countLog-2a 1 is 5, index 6

VC: So this= part seems fine. We pick nodes at index 5 & 6 instead of 0, 1

>>>>=C2=A0 Preferred Assignment: {countLog-2a_0|0=3D#= #########
name=3Dlocalhost-server-6
preferred:0
nonpreferred:0, co= untLog-2a_1|0=3D##########
name=3Dlocalhost-server-7
preferred:0
n= onpreferred:0}

VC: This translates to server-6/server-7 (= since I named them starting 1)

>>>>=C2=A0 Exi= sting Preferred Assignment: {}
>>>>=C2=A0 Existing Non Prefe= rred Assignment: {}
>>>>=C2=A0 Orphaned: [countLog-2a_0|0, c= ountLog-2a_1|0]
>>> Final State Map :{0=3DONLINE}
>>&g= t;> Final ZK record : countLog-2a, {}{countLog-2a_0=3D{localhost-server-= 1=3DONLINE}, countLog-2a_1=3D{localhost-server-1=3DONLINE}}{countLog-2a_0= =3D[localhost-server-1], countLog-2a_1=3D[localhost-server-1]}

VC: But the final effect still seems to be assigning the partitions t= o servers 1 & 2 (first two).

Any ideas on where to st= art poking?


Thanks
Vinoth

On Tue, Mar 15, 2016 at 5:52 PM, Vinoth Chandar <vinoth@uber.com><= /span> wrote:
=
Hi Kishore,

I think the changes I made are ex= ercised when computing the preferred assignment, later when the reconciliat= ion happens with existing assignment/orphaned partitions etc, I think it do= es not take effect.

The effective assignment I saw= was all partitions (2 per resource) were assigned to first 2 servers. I st= arted to dig into the above mentioned parts of the code, will report back t= mrw when I pick this back up.

Thanks,
Vinoth
_____________________________
From: kisho= re g <g.kishore@gmail.com>
Sent: Tuesday, March 15, 2016 2:01 PM
= Subject: Re: Balancing out skews in FULL_AUTO mode with built-in rebalancer=
To: <user@helix.apache.org>



1) I am guessing it = gets overriden by other logic in computePartitionAssignment(..), the end as= signment is still skewed.

=
What is the logic you are referring to?
<= br>
Can you print the assignment count for your use = case?


=
thanks,
Kishore G
On Tue, Mar 15, 2016 at 1:45 PM, Vinoth = Chandar <vinoth@uber.com> wrote:
=
= Hi guys,

We are hitting= a fairly known issue where we have 100s of resource with < 8 resources = spreading across 10 servers and the built-in assignment always assigns part= itions from first to last, resulting in heavy skew for a few nodes. =

Chatted with Kishore offline and = made a patch as here.Tested with 5 resources w= ith 2 partitions each across 8 servers, logging out the nodeShift & ult= imate index picked does indicate that we choose servers other than the firs= t two, which is good

But
1) I am= guessing it gets overriden by other logic in computePartitionAssignment(..= ), the end assignment is still skewed.
=
2) Even with murmur hash, there is some skew on the nodeshift, whi= ch needs to ironed out.

I will keep = chipping at this.. Any feedback appreciated

<= /div>Thanks
Vinoth
=







--001a114b100a02e65c052ee475f7--