Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CA+rfBVz7PWA4BiSHCWkxdLZi=Ea_VkzV0ydyJiowzP4+XOprNQ@mail.gmail.com>
References: 
 <CAE61SPEk-zVcpU2Eue274Vrk8KdUPmX2NatNG1DQMtQJoab_JQ@mail.gmail.com>
	<CAE61SPH+UJTyOo+sSzr8YpOK3upWdExuCQm14NnvZq3YewpN8g@mail.gmail.com>
	<CA+rfBVz7PWA4BiSHCWkxdLZi=Ea_VkzV0ydyJiowzP4+XOprNQ@mail.gmail.com>
Date: Wed, 15 Jul 2015 20:30:55 -0700
Message-ID: 
 <CALte62zrfESETkgz9zXsGQ1Dt6L1HBg07VKDBkrN+cYBzB=4TA@mail.gmail.com>
Subject: Re: Possible to combine all RDDs from a DStream batch into one?
From: Ted Yu <yuzhihong@gmail.com>
To: N B <nb.nospam@gmail.com>
Cc: Jon Chase <jon.chase@gmail.com>,
 "user@spark.apache.org" <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11490146fa4249051af5b253

--001a11490146fa4249051af5b253
Content-Type: text/plain; charset=UTF-8

Looks like this method should serve Jon's needs:

  def reduceByWindow(
      reduceFunc: (T, T) => T,
      windowDuration: Duration,
      slideDuration: Duration

On Wed, Jul 15, 2015 at 8:23 PM, N B <nb.nospam@gmail.com> wrote:

> Hi Jon,
>
> In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
> interchangeably. If you are trying to collect multiple batches across a
> DStream into a single RDD, look at the window() operations.
>
> Hope this helps
> Nikunj
>
>
> On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jon.chase@gmail.com> wrote:
>
>> I should note that the amount of data in each batch is very small, so I'm
>> not concerned with performance implications of grouping into a single RDD.
>>
>> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.chase@gmail.com> wrote:
>>
>>> I'm currently doing something like this in my Spark Streaming program
>>> (Java):
>>>
>>>         dStream.foreachRDD((rdd, batchTime) -> {
>>>             log.info("processing RDD from batch {}", batchTime);
>>>             ....
>>>             // my rdd processing code
>>>             ....
>>>         });
>>>
>>> Instead of having my rdd processing code called once for each RDD in the
>>> batch, is it possible to essentially group all of the RDDs from the batch
>>> into a single RDD and single partition and therefore operate on all of the
>>> elements in the batch at once?
>>>
>>> My goal here is to do an operation exactly once for every batch.  As I
>>> understand it, foreachRDD is going to do the operation once for each RDD in
>>> the batch, which is not what I want.
>>>
>>> I've looked at DStream.repartition(int), but the docs make it sound like
>>> it only changes the number of partitions in the batch's existing RDDs, not
>>> the number of RDDs.
>>>
>>
>>
>

--001a11490146fa4249051af5b253
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Looks like this method should serve Jon&#39;s needs:<div><=
br><div><div>=C2=A0 def reduceByWindow(</div><div>=C2=A0 =C2=A0 =C2=A0 redu=
ceFunc: (T, T) =3D&gt; T,</div><div>=C2=A0 =C2=A0 =C2=A0 windowDuration: Du=
ration,</div><div>=C2=A0 =C2=A0 =C2=A0 slideDuration: Duration</div></div><=
/div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed=
, Jul 15, 2015 at 8:23 PM, N B <span dir=3D"ltr">&lt;<a href=3D"mailto:nb.n=
ospam@gmail.com" target=3D"_blank">nb.nospam@gmail.com</a>&gt;</span> wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Jon,<div><br></div>=
<div>In Spark streaming, 1 batch =3D 1 RDD. Essentially, the terms are used=
 interchangeably. If you are trying to collect multiple batches across a DS=
tream into a single RDD, look at the window() operations.</div><div><br></d=
iv><div>Hope this helps</div><div>Nikunj</div><div><br></div></div><div cla=
ss=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <span dir=3D"lt=
r">&lt;<a href=3D"mailto:jon.chase@gmail.com" target=3D"_blank">jon.chase@g=
mail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">I should note that the amount of data in each batch is very small,=
 so I&#39;m not concerned with performance implications of grouping into a =
single RDD. =C2=A0</div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <span dir=3D"lt=
r">&lt;<a href=3D"mailto:jon.chase@gmail.com" target=3D"_blank">jon.chase@g=
mail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">I&#39;m currently doing something like this in my Spark Streaming =
program (Java):<div><br></div><div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dStream=
.foreachRDD((rdd, batchTime) -&gt; {</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 <a href=3D"http://log.info" target=3D"_blank">log.info</a>(&q=
uot;processing RDD from batch {}&quot;, batchTime);</div></div><div>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ....</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 // my rdd processing code</div><div>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 ....</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 });</div><d=
iv><br></div><div>Instead of having my rdd processing code called once for =
each RDD in the batch, is it possible to essentially group all of the RDDs =
from the batch into a single RDD and single partition and therefore operate=
 on all of the elements in the batch at once?=C2=A0</div><div><br></div><di=
v>My goal here is to do an operation exactly once for every batch.=C2=A0 As=
 I understand it, foreachRDD is going to do the operation once for each RDD=
 in the batch, which is not what I want. =C2=A0</div><div><br></div><div>I&=
#39;ve looked at DStream.repartition(int), but the docs make it sound like =
it only changes the number of partitions in the batch&#39;s existing RDDs, =
not the number of RDDs.</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11490146fa4249051af5b253--