Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CADtDQQLWjz07UDGpFh2X9_zNZy92rmDgGysvjB3N-Hmd7rfjww@mail.gmail.com>
References: 
 <CADtDQQJW8eNL+Lq9m1Tc3JB9SyuQyT0pOVDVG6bGK=Utq3_J7g@mail.gmail.com>
	<CAPh_B=akFumQexQnfFMDUfykEESdbT3A0D8hEMy3yucwYQFyqg@mail.gmail.com>
	<CADtDQQ+90c15WAUCWk=pmwq1mGv51mnRcVJiLHS4wXbE8W2VnA@mail.gmail.com>
	<ECD4CF02-C15C-4B06-828E-3E823699BE11@tetrationanalytics.com>
	<CADtDQQLWjz07UDGpFh2X9_zNZy92rmDgGysvjB3N-Hmd7rfjww@mail.gmail.com>
Date: Thu, 2 Jul 2015 10:27:43 -0700
Message-ID: 
 <CALRVTpL6iZ7zjEYbYCbMqCgB3a3pV9CncPyJsvdet63kvr+83g@mail.gmail.com>
Subject: Re: Grouping runs of elements in a RDD
From: Mohit Jaggi <mohitjaggi@gmail.com>
To: RJ Nowling <rnowling@gmail.com>
Cc: "Abhishek R. Singh" <abhishsi@tetrationanalytics.com>,
 Reynold Xin <rxin@databricks.com>,
	"dev@spark.apache.org" <dev@spark.apache.org>,
 "user@spark.apache.org" <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11472168d234460519e7c1f2

--001a11472168d234460519e7c1f2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

if you are joining successive lines together based on a predicate, then you
are doing a "flatMap" not an "aggregate". you are on the right track with a
multi-pass solution. i had the same challenge when i needed a sliding
window over an RDD(see below).

[ i had suggested that the sliding window API be moved to spark-core. not
sure if that happened ]

----- previous posts ---

http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.ml=
lib.rdd.RDDFunctions

> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitjaggi@gmail.com>
> wrote:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVT=
pKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for thi=
s.
> =E2=80=9Ckey=E2=80=9D them by the partition index. at the end, you will h=
ave a map
>    (partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala=
=E2=80=99s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created =
in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the =E2=80=9Cboundaries=E2=80=9D of the partitions bu=
t it might be
> acceptable for your use case.


On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling <rnowling@gmail.com> wrote:

> That's an interesting idea!  I hadn't considered that.  However, looking
> at the Partitioner interface, I would need to know from looking at a sing=
le
> key which doesn't fit my case, unfortunately.  For my case, I need to
> compare successive pairs of keys.  (I'm trying to re-join lines that were
> split prematurely.)
>
> On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
> abhishsi@tetrationanalytics.com> wrote:
>
>> could you use a custom partitioner to preserve boundaries such that all
>> related tuples end up on the same partition?
>>
>> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rnowling@gmail.com> wrote:
>>
>> Thanks, Reynold.  I still need to handle incomplete groups that fall
>> between partition boundaries. So, I need a two-pass approach. I came up
>> with a somewhat hacky way to handle those using the partition indices an=
d
>> key-value pairs as a second pass after the first.
>>
>> OCaml's std library provides a function called group() that takes a brea=
k
>> function that operators on pairs of successive elements.  It seems a
>> similar approach could be used in Spark and would be more efficient than=
 my
>> approach with key-value pairs since you know the ordering of the partiti=
ons.
>>
>> Has this need been expressed by others?
>>
>> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rxin@databricks.com> wrote=
:
>>
>>> Try mapPartitions, which gives you an iterator, and you can produce an
>>> iterator back.
>>>
>>>
>>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowling@gmail.com> wrote=
:
>>>
>>>> Hi all,
>>>>
>>>> I have a problem where I have a RDD of elements:
>>>>
>>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>>
>>>> and I want to run a function over them to decide which runs of element=
s
>>>> to group together:
>>>>
>>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>>
>>>> Technically, I could use aggregate to do this, but I would have to use
>>>> a List of List of T which would produce a very large collection in mem=
ory.
>>>>
>>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>>> have a version of aggregate where the combination function can return =
a
>>>> complete group that is added to the new RDD and an incomplete group wh=
ich
>>>> is passed to the next call of the reduce function.
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>
>>>
>>
>>
>

--001a11472168d234460519e7c1f2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>if you are joining successive lines together based on=
 a predicate, then you are doing a &quot;flatMap&quot; not an &quot;aggrega=
te&quot;. you are on the right track with a multi-pass solution. i had the =
same challenge when i needed a sliding window over an RDD(see below).=C2=A0=
</div><div><br></div><div>[ i had suggested that the sliding window API be =
moved to spark-core. not sure if that happened ]</div><div><br></div><div>-=
---- previous posts ---</div><div><br></div><a href=3D"http://spark.apache.=
org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions=
">http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.=
mllib.rdd.RDDFunctions</a><br><div><br></div><div><pre style=3D"white-space=
:pre-wrap;color:rgb(0,0,0)">&gt; On Fri, Jan 30, 2015 at 12:27 AM, Mohit Ja=
ggi &lt;<a href=3D"mailto:mohitjaggi@gmail.com">mohitjaggi@gmail.com</a>&gt=
;
&gt; wrote:
&gt;
&gt;
&gt; <a href=3D"http://mail-archives.apache.org/mod_mbox/spark-user/201405.=
mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%=
3E">http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALR=
VTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E</a>
&gt;
&gt; you can use the MLLib function or do the following (which is what I ha=
d
&gt; done):
&gt;
&gt; - in first pass over the data, using mapPartitionWithIndex, gather the
&gt; first item in each partition. you can use collect (or aggregator) for =
this.
&gt; =E2=80=9Ckey=E2=80=9D them by the partition index. at the end, you wil=
l have a map
&gt;    (partition index) --&gt; first item
&gt; - in the second pass over the data, using mapPartitionWithIndex again,
&gt; look at two (or in the general case N items at a time, you can use sca=
la=E2=80=99s
&gt; sliding iterator) items at a time and check the time difference(or any
&gt; sliding window computation). To this mapParitition, pass the map creat=
ed in
&gt; previous step. You will need to use them to check the last item in thi=
s
&gt; partition.
&gt;
&gt; If you can tolerate a few inaccuracies then you can just do the second
&gt; step. You will miss the =E2=80=9Cboundaries=E2=80=9D of the partitions=
 but it might be
&gt; acceptable for your use case.</pre></div></div><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote">On Tue, Jun 30, 2015 at 12:21 PM, RJ Now=
ling <span dir=3D"ltr">&lt;<a href=3D"mailto:rnowling@gmail.com" target=3D"=
_blank">rnowling@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><div dir=3D"ltr">That&#39;s an interesting idea!=C2=A0 I hadn&#39;t=
 considered that.=C2=A0 However, looking at the Partitioner interface, I wo=
uld need to know from looking at a single key which doesn&#39;t fit my case=
, unfortunately.=C2=A0 For my case, I need to compare successive pairs of k=
eys. =C2=A0(I&#39;m trying to re-join lines that were split prematurely.)</=
div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br>=
<div class=3D"gmail_quote">On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Sin=
gh <span dir=3D"ltr">&lt;<a href=3D"mailto:abhishsi@tetrationanalytics.com"=
 target=3D"_blank">abhishsi@tetrationanalytics.com</a>&gt;</span> wrote:<br=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">could y=
ou use a custom partitioner to preserve boundaries such that all related tu=
ples end up on the same partition?<div><div><div><br><div><div>On Jun 30, 2=
015, at 12:00 PM, RJ Nowling &lt;<a href=3D"mailto:rnowling@gmail.com" targ=
et=3D"_blank">rnowling@gmail.com</a>&gt; wrote:</div><br><blockquote type=
=3D"cite"><div dir=3D"ltr">Thanks, Reynold.=C2=A0 I still need to handle in=
complete groups that fall between partition boundaries. So, I need a two-pa=
ss approach. I came up with a somewhat hacky way to handle those using the =
partition indices and key-value pairs as a second pass after the first.<div=
><br></div><div>OCaml&#39;s std library provides a function called group() =
that takes a break function that operators on pairs of successive elements.=
=C2=A0 It seems a similar approach could be used in Spark and would be more=
 efficient than my approach with key-value pairs since you know the orderin=
g of the partitions.</div><div><br></div><div>Has this need been expressed =
by others? =C2=A0<br></div></div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <span dir=3D"=
ltr">&lt;<a href=3D"mailto:rxin@databricks.com" target=3D"_blank">rxin@data=
bricks.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Try mapPartitions, which gives you an iterator, and you can produc=
e an iterator back.<div><br></div></div><div><div><div class=3D"gmail_extra=
"><br><div class=3D"gmail_quote">On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowli=
ng <span dir=3D"ltr">&lt;<a href=3D"mailto:rnowling@gmail.com" target=3D"_b=
lank">rnowling@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex"><div dir=3D"ltr">Hi all,<div><br></div><div>I have a problem where I =
have a RDD of elements:</div><div><br></div><div>Item1 Item2 Item3 Item4 It=
em5 Item6 ...</div><div><br></div><div>and I want to run a function over th=
em to decide which runs of elements to group together:</div><div><br></div>=
<div>[Item1 Item2] [Item3] [Item4 Item5 Item6] ...</div><div><br></div><div=
>Technically, I could use aggregate to do this, but I would have to use a L=
ist of List of T which would produce a very large collection in memory.</di=
v><div><br></div><div>Is there an easy way to accomplish this? =C2=A0e.g.,,=
 it would be nice to have a version of aggregate where the combination func=
tion can return a complete group that is added to the new RDD and an incomp=
lete group which is passed to the next call of the reduce function.</div><d=
iv><br></div><div>Thanks,</div><div>RJ</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</div></div></blockquote></div><br></div>

--001a11472168d234460519e7c1f2--