Mailing-List: contact user-help@helix.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@helix.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAAUkj_8d-6M7CGtArs+5bMBdRGSj=FokTcFsutFDzDR9GhPykQ@mail.gmail.com>
References: 
 <CAAUkj_880TF6EyHH_pNLz8t0vUHGbgXu93HtgK9HZgt+tLH56Q@mail.gmail.com>
	<CABaj-QZHqAKFCRzxS_-J=-BkSkUkXQ77xFpZG4jMZfa=Le8N0A@mail.gmail.com>
	<B5148B24F77D1187.8FAFFFD2-660E-4B0F-B80F-4F70FB847E01@mail.outlook.com>
	<CAAUkj__NFCpbiR4yyDqnTPXU0KdcjNuyzQseB7=f+UuNpKtpPA@mail.gmail.com>
	<CABaj-QYxKWk3O1gV7z8EC-gwqza=XUkT7gWurLV22SftW30_-w@mail.gmail.com>
	<CAAUkj_8d-6M7CGtArs+5bMBdRGSj=FokTcFsutFDzDR9GhPykQ@mail.gmail.com>
Date: Fri, 25 Mar 2016 12:20:34 -0700
Message-ID: 
 <CABaj-QYRCfYsOUiHm+FbDjXJQUWs_Qq9in_WS4Ua-_gmm5HGBw@mail.gmail.com>
Subject: Re: Balancing out skews in FULL_AUTO mode with built-in rebalancer
From: kishore g <g.kishore@gmail.com>
To: "user@helix.apache.org" <user@helix.apache.org>
Content-Type: multipart/alternative; boundary=001a114b100a02e65c052ee475f7

--001a114b100a02e65c052ee475f7
Content-Type: text/plain; charset=UTF-8

so computeOrphans is the one thats causing the behavior.

In the beginning when nothing is assigned, all replicas are considered as
orphans. Once they are considered as Orphan, they get assigned to any
random node (this overrides everything thats computed by the placement
scheme)

I think the logic in computeOrphaned is broken, a replica should be treated
as Orphan if the preferred node is not part of live node list.

Try this in computeOrphaned. Note, the test case might fail because of this
change and you will might have to change that according to new behavior. I
think it will be good to introduce this behavior based on cluster config
parameter.

 private Set<Replica> computeOrphaned() {
    Set<Replica> orphanedPartitions = new TreeSet<Replica>();
    for(Entry<Replica, Node> entry:_preferredAssignment.entrySet()){
      if(!_liveNodesList.contains(entry.getValue())){
        orphanedPartitions.add(entry.getKey());
      }
    }
    for (Replica r : _existingPreferredAssignment.keySet()) {
      if (orphanedPartitions.contains(r)) {
        orphanedPartitions.remove(r);
      }
    }
    for (Replica r : _existingNonPreferredAssignment.keySet()) {
      if (orphanedPartitions.contains(r)) {
        orphanedPartitions.remove(r);
      }
    }

    return orphanedPartitions;
  }

On Fri, Mar 25, 2016 at 8:41 AM, Vinoth Chandar <vinoth@uber.com> wrote:

> Here you go
>
> https://gist.github.com/vinothchandar/18feedfa84650e3efdc0
>
>
> On Fri, Mar 25, 2016 at 8:32 AM, kishore g <g.kishore@gmail.com> wrote:
>
>> Can you point me to your code. fork/patch?
>>
>> On Fri, Mar 25, 2016 at 5:26 AM, Vinoth Chandar <vinoth@uber.com> wrote:
>>
>>> Hi Kishore,
>>>
>>> Printed out more information and trimmed the test down to 1 resource
>>> with 2 partitions, and I bring up 8 servers in parallel.
>>>
>>> Below is the paste of my logging output + annotations.
>>>
>>> >>> Computing partition assignment
>>> >>>> NodeShift for countLog-2a 0 is 5, index 5
>>> >>>> NodeShift for countLog-2a 1 is 5, index 6
>>>
>>> VC: So this part seems fine. We pick nodes at index 5 & 6 instead of 0, 1
>>>
>>> >>>>  Preferred Assignment: {countLog-2a_0|0=##########
>>> name=localhost-server-6
>>> preferred:0
>>> nonpreferred:0, countLog-2a_1|0=##########
>>> name=localhost-server-7
>>> preferred:0
>>> nonpreferred:0}
>>>
>>> VC: This translates to server-6/server-7 (since I named them starting 1)
>>>
>>> >>>>  Existing Preferred Assignment: {}
>>> >>>>  Existing Non Preferred Assignment: {}
>>> >>>>  Orphaned: [countLog-2a_0|0, countLog-2a_1|0]
>>> >>> Final State Map :{0=ONLINE}
>>> >>>> Final ZK record : countLog-2a,
>>> {}{countLog-2a_0={localhost-server-1=ONLINE},
>>> countLog-2a_1={localhost-server-1=ONLINE}}{countLog-2a_0=[localhost-server-1],
>>> countLog-2a_1=[localhost-server-1]}
>>>
>>> VC: But the final effect still seems to be assigning the partitions to
>>> servers 1 & 2 (first two).
>>>
>>> Any ideas on where to start poking?
>>>
>>>
>>> Thanks
>>> Vinoth
>>>
>>> On Tue, Mar 15, 2016 at 5:52 PM, Vinoth Chandar <vinoth@uber.com> wrote:
>>>
>>>> Hi Kishore,
>>>>
>>>> I think the changes I made are exercised when computing the preferred
>>>> assignment, later when the reconciliation happens with existing
>>>> assignment/orphaned partitions etc, I think it does not take effect.
>>>>
>>>> The effective assignment I saw was all partitions (2 per resource) were
>>>> assigned to first 2 servers. I started to dig into the above mentioned
>>>> parts of the code, will report back tmrw when I pick this back up.
>>>>
>>>> Thanks,
>>>> Vinoth
>>>>
>>>> _____________________________
>>>> From: kishore g <g.kishore@gmail.com>
>>>> Sent: Tuesday, March 15, 2016 2:01 PM
>>>> Subject: Re: Balancing out skews in FULL_AUTO mode with built-in
>>>> rebalancer
>>>> To: <user@helix.apache.org>
>>>>
>>>>
>>>>
>>>> 1) I am guessing it gets overriden by other logic in
>>>> computePartitionAssignment(..), the end assignment is still skewed.
>>>>
>>>> What is the logic you are referring to?
>>>>
>>>> Can you print the assignment count for your use case?
>>>>
>>>>
>>>> thanks,
>>>> Kishore G
>>>>
>>>> On Tue, Mar 15, 2016 at 1:45 PM, Vinoth Chandar <vinoth@uber.com>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> We are hitting a fairly known issue where we have 100s of resource
>>>>> with < 8 resources spreading across 10 servers and the built-in assignment
>>>>> always assigns partitions from first to last, resulting in heavy skew for a
>>>>> few nodes.
>>>>>
>>>>> Chatted with Kishore offline and made a patch as here
>>>>> <https://gist.github.com/vinothchandar/e8837df301501f85e257>.Tested
>>>>> with 5 resources with 2 partitions each across 8 servers, logging out the
>>>>> nodeShift & ultimate index picked does indicate that we choose servers
>>>>> other than the first two, which is good
>>>>>
>>>>> But
>>>>> 1) I am guessing it gets overriden by other logic in
>>>>> computePartitionAssignment(..), the end assignment is still skewed.
>>>>> 2) Even with murmur hash, there is some skew on the nodeshift, which
>>>>> needs to ironed out.
>>>>>
>>>>> I will keep chipping at this.. Any feedback appreciated
>>>>>
>>>>> Thanks
>>>>> Vinoth
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

--001a114b100a02e65c052ee475f7
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">so computeOrphans is the one thats causing the behavior.<d=
iv><br></div><div>In the beginning when nothing is assigned, all replicas a=
re considered as orphans. Once they are considered as Orphan, they get assi=
gned to any random node (this overrides everything thats computed by the pl=
acement scheme)</div><div><br></div><div>I think the logic in computeOrphan=
ed is broken, a replica should be treated as Orphan if the preferred node i=
s not part of live node list.</div><div><br></div><div>Try this in computeO=
rphaned. Note, the test case might fail because of this change and you will=
 might have to change that according to new behavior. I think it will be go=
od to introduce this behavior based on cluster config parameter.</div><div>=
<br></div><div><div>=C2=A0private Set&lt;Replica&gt; computeOrphaned() {</d=
iv><div>=C2=A0 =C2=A0 Set&lt;Replica&gt; orphanedPartitions =3D new TreeSet=
&lt;Replica&gt;();</div><div>=C2=A0 =C2=A0 for(Entry&lt;Replica, Node&gt; e=
ntry:_preferredAssignment.entrySet()){</div><div>=C2=A0 =C2=A0 =C2=A0 if(!_=
liveNodesList.contains(entry.getValue())){</div><div>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 orphanedPartitions.add(entry.getKey());</div><div>=C2=A0 =C2=A0 =C2=
=A0 }</div><div>=C2=A0 =C2=A0 }</div><div>=C2=A0 =C2=A0 for (Replica r : _e=
xistingPreferredAssignment.keySet()) {</div><div>=C2=A0 =C2=A0 =C2=A0 if (o=
rphanedPartitions.contains(r)) {</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 orph=
anedPartitions.remove(r);</div><div>=C2=A0 =C2=A0 =C2=A0 }</div><div>=C2=A0=
 =C2=A0 }</div><div>=C2=A0 =C2=A0 for (Replica r : _existingNonPreferredAss=
ignment.keySet()) {</div><div>=C2=A0 =C2=A0 =C2=A0 if (orphanedPartitions.c=
ontains(r)) {</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 orphanedPartitions.remo=
ve(r);</div><div>=C2=A0 =C2=A0 =C2=A0 }</div><div>=C2=A0 =C2=A0 }</div><div=
><br></div><div>=C2=A0 =C2=A0 return orphanedPartitions;</div><div>=C2=A0 }=
</div></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
>On Fri, Mar 25, 2016 at 8:41 AM, Vinoth Chandar <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:vinoth@uber.com" target=3D"_blank">vinoth@uber.com</a>&gt;</s=
pan> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Here you go =
<br><br><a href=3D"https://gist.github.com/vinothchandar/18feedfa84650e3efd=
c0" target=3D"_blank">https://gist.github.com/vinothchandar/18feedfa84650e3=
efdc0</a><br><br></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Mar 25, 2016 at 8:3=
2 AM, kishore g <span dir=3D"ltr">&lt;<a href=3D"mailto:g.kishore@gmail.com=
" target=3D"_blank">g.kishore@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr">Can you point me to your code. fork/p=
atch?</div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">On Fri, Mar 25, 2016 at 5:26 AM, Vinoth Chandar <span dir=3D"ltr">&lt;<=
a href=3D"mailto:vinoth@uber.com" target=3D"_blank">vinoth@uber.com</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div=
><div>Hi Kishore, <br><br></div>Printed out more information and trimmed th=
e test down to 1 resource with 2 partitions, and I bring up 8 servers in pa=
rallel. <br><br></div><div>Below is the paste of my logging output + annota=
tions. <br></div><div><br>&gt;&gt;&gt; Computing partition assignment<br>&g=
t;&gt;&gt;&gt; NodeShift for countLog-2a 0 is 5, index 5 <br>&gt;&gt;&gt;&g=
t; NodeShift for countLog-2a 1 is 5, index 6 <br><br></div><div>VC: So this=
 part seems fine. We pick nodes at index 5 &amp; 6 instead of 0, 1<br></div=
><div><br>&gt;&gt;&gt;&gt;=C2=A0 Preferred Assignment: {countLog-2a_0|0=3D#=
#########<br>name=3Dlocalhost-server-6<br>preferred:0<br>nonpreferred:0, co=
untLog-2a_1|0=3D##########<br>name=3Dlocalhost-server-7<br>preferred:0<br>n=
onpreferred:0}<br><br></div><div>VC: This translates to server-6/server-7 (=
since I named them starting 1)<br></div><div><br>&gt;&gt;&gt;&gt;=C2=A0 Exi=
sting Preferred Assignment: {}<br>&gt;&gt;&gt;&gt;=C2=A0 Existing Non Prefe=
rred Assignment: {}<br>&gt;&gt;&gt;&gt;=C2=A0 Orphaned: [countLog-2a_0|0, c=
ountLog-2a_1|0]<br>&gt;&gt;&gt; Final State Map :{0=3DONLINE}<br>&gt;&gt;&g=
t;&gt; Final ZK record : countLog-2a, {}{countLog-2a_0=3D{localhost-server-=
1=3DONLINE}, countLog-2a_1=3D{localhost-server-1=3DONLINE}}{countLog-2a_0=
=3D[localhost-server-1], countLog-2a_1=3D[localhost-server-1]}<br><br></div=
><div>VC: But the final effect still seems to be assigning the partitions t=
o servers 1 &amp; 2 (first two).<br><br></div><div>Any ideas on where to st=
art poking?<br></div><div><br><br></div>Thanks<span><font color=3D"#888888"=
><br></font></span></div><span><font color=3D"#888888">Vinoth<br></font></s=
pan></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quot=
e">On Tue, Mar 15, 2016 at 5:52 PM, Vinoth Chandar <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:vinoth@uber.com" target=3D"_blank">vinoth@uber.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex">
    <div style=3D"padding-left:20px;padding-right:20px;padding-bottom:8px">=
<div>Hi Kishore,</div><div><br></div><div>I think the changes I made are ex=
ercised when computing the preferred assignment, later when the reconciliat=
ion happens with existing assignment/orphaned partitions etc, I think it do=
es not take effect.</div><div><br></div><div>The effective assignment I saw=
 was all partitions (2 per resource) were assigned to first 2 servers. I st=
arted to dig into the above mentioned parts of the code, will report back t=
mrw when I pick this back up.</div><div><br><div>Thanks,<br>Vinoth</div><br=
></div></div>
    <div class=3D"gmail_quote">_____________________________<br>From: kisho=
re g &lt;<a dir=3D"ltr" href=3D"mailto:g.kishore@gmail.com" target=3D"_blan=
k">g.kishore@gmail.com</a>&gt;<br>Sent: Tuesday, March 15, 2016 2:01 PM<br>=
Subject: Re: Balancing out skews in FULL_AUTO mode with built-in rebalancer=
<br>To:  &lt;<a dir=3D"ltr" href=3D"mailto:user@helix.apache.org" target=3D=
"_blank">user@helix.apache.org</a>&gt;<div><div><br><br><br>    <div dir=3D=
"ltr">   <div>    <div style=3D"font-size:12.8px">     1) I am guessing it =
gets overriden by other logic in computePartitionAssignment(..), the end as=
signment is still skewed.     <br>    </div>    <div>     <br>    </div>   =
 <div>     What is the logic you are referring to?    </div>    <div>     <=
br>    </div>    <div>     Can you print the assignment count for your use =
case?    </div>    <div>     <br>    </div>    <div>     <br>    </div>    =
<span style=3D"font-size:12.8px"></span>   </div>   <div>    thanks,   </di=
v>   <div>    Kishore G   </div>  </div>  <div class=3D"gmail_extra">   <br=
>   <div class=3D"gmail_quote">    On Tue, Mar 15, 2016 at 1:45 PM, Vinoth =
Chandar     <span dir=3D"ltr">&lt;<a href=3D"mailto:vinoth@uber.com" target=
=3D"_blank">vinoth@uber.com</a>&gt;</span> wrote:    <br>    <blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">     <div dir=3D"ltr">      <div>       <div>        <div>=
         <div>          <div>           <div>            <div>             =
Hi guys,              <br>             <br>            </div>We are hitting=
 a fairly known issue where we have 100s of resource with &lt; 8 resources =
spreading across 10 servers and the built-in assignment always assigns part=
itions from first to last, resulting in heavy skew for a few nodes.        =
     <br>            <br>           </div>Chatted with Kishore offline and =
made a patch as            <a href=3D"https://gist.github.com/vinothchandar=
/e8837df301501f85e257" target=3D"_blank">here</a>.Tested with 5 resources w=
ith 2 partitions each across 8 servers, logging out the nodeShift &amp; ult=
imate index picked does indicate that we choose servers other than the firs=
t two, which is good           <br>           <br>But           <br>1) I am=
 guessing it gets overriden by other logic in computePartitionAssignment(..=
), the end assignment is still skewed.           <br>          </div>      =
   </div>2) Even with murmur hash, there is some skew on the nodeshift, whi=
ch needs to ironed out.         <br>         <br>        </div>I will keep =
chipping at this.. Any feedback appreciated        <br>        <br>       <=
/div>Thanks       <span><font color=3D"#888888"><br></font></span>      </d=
iv>      <span><font color=3D"#888888">Vinoth<br></font></span>     </div> =
    </blockquote>   </div>   <br>  </div>  <br><br></div></div></div>
  </blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a114b100a02e65c052ee475f7--