Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CANcNbeG0VL0mU5Ze4WLD8QBAnb0b_Ge2M9ZSEoN4ScbfqT5=OQ@mail.gmail.com>
References: 
 <CANcNbeG0VL0mU5Ze4WLD8QBAnb0b_Ge2M9ZSEoN4ScbfqT5=OQ@mail.gmail.com>
From: Josh Wills <josh.wills@gmail.com>
Date: Wed, 9 Dec 2015 20:56:11 -0800
Message-ID: 
 <CANb5z2+3icELVRPE0T6rYjS4SCEPtBdRHQR2Gyo0f_4J_SbHgw@mail.gmail.com>
Subject: Re: Secondary sort and partitioning in Spark
To: "user@crunch.apache.org" <user@crunch.apache.org>
Content-Type: multipart/alternative; boundary=001a114e48c0c1892d05268407ad

--001a114e48c0c1892d05268407ad
Content-Type: text/plain; charset=UTF-8

Hrm-- so you're saying records for the same GroupByKey are ending up in
different partitions when you're doing a secondary sort? Sounds like a bug
in the SparkPartitioner we're using-- I wonder if it was the same bug that
was fixed here?

https://issues.apache.org/jira/browse/CRUNCH-556

On Wed, Dec 9, 2015 at 6:05 PM, Andrey Gusev <andrey@siftscience.com> wrote:

> Hello crunch!
>
> I am running into problems with partitioning of groups with secondary sort
> running on SparkPipeline.
>
> What I am observing is that records belonging to a single group may be
> split across two or more calls to apply DoFn. This could be a gap in my
> understanding of Spark execution model wrt to locality - and if so, can
> *all* the records belonging to a groupBy key be forced to a single call?
>
> Roughly speaking the code looks like this:
>
> PTableType<GroupByKey, Pair<SortKey, Info>> pType =
> tableOf(Writables.writables(GroupByKey.class),
> Writables.pairs(Writables.writables(SortKey.class),
> Writables.writables(Info.class)));
>
> // note that dataset has been explicitly sharded by numPartitions
> PTable< GroupByKey, Pair< SortKey, Info >> infos = dataset.parallelDo(...,
> pType);
>
> PTable< SortKey, Info > mergedInfos =
> SecondarySort.sortAndApply(infos, mergeInfos(...),
> mergeType, numPartitions);
>
> static class GroupByKey implements Writable {
>
> public int treeId;
> public int nodeId;
> ...
> }
>
> I can confirm that records come in sorted and grouped but I am also
> observing that a single group may be executed on at different nodes. More
> concretely lets say group belonging to treeId=0, nodeId=0 has 100 records,
> the first 30 may show up on node1, and the remaining on node2 (in both
> cases sorted). Informally it does look like it basically ensures that each
> node is scheduled to process the same number of records. It's especially
> evident with 2 partition where exactly one group is split.
>
> The semantics of the code (at least for now) require all the values to
> come in with a single group. Can that be forced?
>
> env: spark 1.5 and crunch 0.11.0
>
> Any thoughts would be appreciated!
>

--001a114e48c0c1892d05268407ad
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hrm-- so you&#39;re saying records for the same GroupByKey=
 are ending up in different partitions when you&#39;re doing a secondary so=
rt? Sounds like a bug in the SparkPartitioner we&#39;re using-- I wonder if=
 it was the same bug that was fixed here?<div><br></div><div><a href=3D"htt=
ps://issues.apache.org/jira/browse/CRUNCH-556">https://issues.apache.org/ji=
ra/browse/CRUNCH-556</a><br></div></div><div class=3D"gmail_extra"><br><div=
 class=3D"gmail_quote">On Wed, Dec 9, 2015 at 6:05 PM, Andrey Gusev <span d=
ir=3D"ltr">&lt;<a href=3D"mailto:andrey@siftscience.com" target=3D"_blank">=
andrey@siftscience.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><div dir=3D"ltr"><div style=3D"font-size:12.8px">Hello crunch!</div><di=
v style=3D"font-size:12.8px"><br></div><span style=3D"font-size:12.8px">I a=
m running into problems with partitioning of groups with secondary sort run=
ning on SparkPipeline.=C2=A0</span><div style=3D"font-size:12.8px"><br></di=
v><div style=3D"font-size:12.8px">What I am observing is that records belon=
ging to a single group may be split across two or more calls to apply DoFn.=
 This could be a gap in my understanding of Spark execution model wrt to lo=
cality - and if so, can *all* the records belonging to a groupBy key be for=
ced to a single call?</div><div style=3D"font-size:12.8px"><br></div><div s=
tyle=3D"font-size:12.8px">Roughly speaking the code looks like this:</div><=
div style=3D"font-size:12.8px"><br></div><span style=3D"font-size:12.8px">P=
TableType&lt;GroupByKey, Pair&lt;SortKey, Info&gt;&gt; pType =3D</span><br =
style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">tableOf(Writabl=
es.writables(</span><span style=3D"font-size:12.8px">GroupByKey.class),</sp=
an><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">Writable=
s.pairs(Writables.</span><span style=3D"font-size:12.8px">writables(SortKey=
.class),</span><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8=
px">Writables.writables(Info.</span><span style=3D"font-size:12.8px">class)=
));</span><div style=3D"font-size:12.8px"><br></div><div style=3D"font-size=
:12.8px">// note that dataset has been explicitly sharded by numPartitions<=
/div><span style=3D"font-size:12.8px">PTable&lt;=C2=A0GroupByKey, Pair&lt;=
=C2=A0SortKey,=C2=A0Info=C2=A0&gt;&gt; infos =3D dataset.parallelDo(..., pT=
ype);</span><div style=3D"font-size:12.8px"><br></div><span style=3D"font-s=
ize:12.8px">PTable&lt;=C2=A0SortKey,=C2=A0Info=C2=A0&gt; mergedInfos =3D</s=
pan><br style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">Seconda=
rySort.sortAndApply(</span><span style=3D"font-size:12.8px">infos, mergeInf=
os(...), mergeType,=C2=A0numPartitions);</span><div style=3D"font-size:12.8=
px"><br></div><span style=3D"font-size:12.8px">static class GroupByKey impl=
ements Writable {</span><br style=3D"font-size:12.8px"><br style=3D"font-si=
ze:12.8px"><span style=3D"font-size:12.8px">public int treeId;</span><br st=
yle=3D"font-size:12.8px"><span style=3D"font-size:12.8px">public int nodeId=
;</span><div style=3D"font-size:12.8px">...</div><div style=3D"font-size:12=
.8px">}</div><div style=3D"font-size:12.8px"><br></div><div style=3D"font-s=
ize:12.8px">I can confirm that records come in sorted and grouped but I am =
also observing that a single group may be executed on at different nodes. M=
ore concretely lets say group belonging to treeId=3D0, nodeId=3D0 has 100 r=
ecords, the first 30 may show up on node1, and the remaining on node2 (in b=
oth cases sorted). Informally it does look like it basically ensures that e=
ach node is scheduled to process the same number of records. It&#39;s espec=
ially evident with 2 partition where exactly one group is split.</div><div =
style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">The se=
mantics of the code (at least for now) require all the values to come in wi=
th a single group. Can that be forced?</div><br style=3D"font-size:12.8px">=
<span style=3D"font-size:12.8px">env: spark 1.5 and crunch 0.11.0</span><di=
v style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">Any =
thoughts would be appreciated!</div></div>
</blockquote></div><br></div>

--001a114e48c0c1892d05268407ad--