Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CA+rMtKQLzNVUFGpzYL0z3vKf1WRsejPrH1iifdPDEAz7Hw2s9g@mail.gmail.com>
References: 
 <CA+rMtKQLzNVUFGpzYL0z3vKf1WRsejPrH1iifdPDEAz7Hw2s9g@mail.gmail.com>
From: Tathagata Das <tdas@databricks.com>
Date: Tue, 27 Oct 2015 16:44:38 -0700
Message-ID: 
 <CA+AHuKmAbGMdfEOYEmzC+2ajMDLg8tP-oTUDREoSsQAUO5Oyig@mail.gmail.com>
Subject: Re: expected Kinesis checkpoint behavior when driver restarts
To: Hster Geguri <hster.investigates@gmail.com>
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=94eb2c05411efa0c5005231eaa64

--94eb2c05411efa0c5005231eaa64
Content-Type: text/plain; charset=UTF-8

Your observation is correct! The current implementation of checkpointing to
DynamoDB is tied to the presence of new data from Kinesis (I think that
emulates the KCL behavior), if there is no data for while, the
checkpointing does not occur. That explains your observation.

I have filed a JIRA to fix this -
https://issues.apache.org/jira/browse/SPARK-11359
Should be available in 1.6


On Tue, Oct 27, 2015 at 4:09 PM, Hster Geguri <hster.investigates@gmail.com>
wrote:

> We are using Kinesis with Spark Streaming 1.5 on a YARN cluster.  When we
> enable checkpointing in Spark, where in the Kinesis stream should a
> restarted driver continue? I run a simple experiment as follows:
>
> 1. In the first driver run, Spark driver processes 1 million records
> starting from InitialPositionInStream.TRIM_HORIZON  in 5 second batch
> intervals with 10 seconds set as the Kinesis receiver checkpoint interval.
> (This interval has been purposely set low to see the impact of where a
> restarted driver would pick up. )
>
> 2. We stop pushing events to Kinesis stream until the driver keeps pulling
> zero events for a few minutes. Then first driver killed manually through
> "yarn application --kill".
>
> 3. The driver is relaunched a second time and the logs show it
> successfully restored from the DFS checkpoint directory. Because the first
> driver had completely processed all the entries in the stream, I would
> expect the second driver to pick up at the end of the stream or at minimum
> the last 10 second interval window. However the second driver launch (and
> subsequent driver launches)  re-processes about 30 seconds worth of
> (100,000) events and appears not to be related to the Kinesis checkpoint
> interval.
>
> Also with a Kinesis driver, does it make sense you would use Write Ahead
> Logs and incur the cost of writing to DFS when you could remember the
> previous to last checkpoint and just reprocess/refetch directly from the
> stream?
>
> Any input is highly appreciated.
>
> Thanks,
> Heji
>

--94eb2c05411efa0c5005231eaa64
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Your observation is correct! The current implementation of=
 checkpointing to DynamoDB is tied to the presence of new data from Kinesis=
 (I think that emulates the KCL behavior), if there is no data for while, t=
he checkpointing does not occur. That explains your observation.=C2=A0<div>=
<br></div><div>I have filed a JIRA to fix this -=C2=A0<a href=3D"https://is=
sues.apache.org/jira/browse/SPARK-11359">https://issues.apache.org/jira/bro=
wse/SPARK-11359</a></div><div>Should be available in 1.6</div><div><br></di=
v></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, O=
ct 27, 2015 at 4:09 PM, Hster Geguri <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:hster.investigates@gmail.com" target=3D"_blank">hster.investigates@gmail.=
com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr=
"><div style=3D"color:rgb(80,0,80);font-family:&#39;Times New Roman&#39;;fo=
nt-size:16px">We are using Kinesis with Spark Streaming 1.5 on a YARN clust=
er.=C2=A0 When we enable checkpointing in=C2=A0Spark, where in the Kinesis =
stream should a restarted driver=C2=A0continue? I run a simple experiment a=
s follows:<br></div><div style=3D"color:rgb(80,0,80);font-family:&#39;Times=
 New Roman&#39;;font-size:16px"><br></div><div style=3D"color:rgb(80,0,80);=
font-family:&#39;Times New Roman&#39;;font-size:16px">1. In the first drive=
r run, Spark driver processes 1 million records starting from InitialPositi=
onInStream.TRIM_HORIZON=C2=A0=C2=A0in 5 second batch intervals with 10 seco=
nds set as the Kinesis receiver checkpoint interval. (This interval has bee=
n purposely set low to see the impact of where a restarted driver would pic=
k up. )</div><div style=3D"color:rgb(80,0,80);font-family:&#39;Times New Ro=
man&#39;;font-size:16px"><br></div><div style=3D"color:rgb(80,0,80);font-fa=
mily:&#39;Times New Roman&#39;;font-size:16px">2. We stop pushing events to=
 Kinesis stream until the driver keeps pulling zero events for a few minute=
s. Then first driver killed manually through &quot;yarn application --kill&=
quot;.</div><div style=3D"color:rgb(80,0,80);font-family:&#39;Times New Rom=
an&#39;;font-size:16px"><br></div><div style=3D"color:rgb(80,0,80);font-fam=
ily:&#39;Times New Roman&#39;;font-size:16px">3. The driver is relaunched a=
 second time and the logs show it successfully restored from the DFS checkp=
oint directory. Because the first driver had completely processed all the e=
ntries in the stream, I would expect the second driver to pick up at the en=
d of the stream or at minimum the last 10 second interval window. However t=
he second driver launch (and subsequent driver launches) =C2=A0re-processes=
 about 30 seconds worth of (100,000) events and appears not to be related t=
o the Kinesis checkpoint interval.</div><div style=3D"color:rgb(80,0,80);fo=
nt-family:&#39;Times New Roman&#39;;font-size:16px"><br></div><div style=3D=
"color:rgb(80,0,80);font-family:&#39;Times New Roman&#39;;font-size:16px">A=
lso with a Kinesis driver, does it make sense you would use Write Ahead Log=
s and incur the cost of writing to DFS when you could remember the previous=
 to last checkpoint and just reprocess/refetch directly from the stream?</d=
iv><div style=3D"color:rgb(80,0,80);font-family:&#39;Times New Roman&#39;;f=
ont-size:16px"><br></div><div style=3D"color:rgb(80,0,80);font-family:&#39;=
Times New Roman&#39;;font-size:16px">Any input is highly appreciated.</div>=
<div style=3D"color:rgb(80,0,80);font-family:&#39;Times New Roman&#39;;font=
-size:16px"><br></div><div style=3D"color:rgb(80,0,80);font-family:&#39;Tim=
es New Roman&#39;;font-size:16px">Thanks,</div><div style=3D"color:rgb(80,0=
,80);font-family:&#39;Times New Roman&#39;;font-size:16px">Heji</div></div>
</blockquote></div><br></div>

--94eb2c05411efa0c5005231eaa64--