Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAKWX9VXiH_4czF-52eZu51sA-_cfTJX0JXaP_wi7G2VNZfGmiA@mail.gmail.com>
References: 
 <CAFCxkQDUkgRJaJAZ-r3Tc0JOvvFeMn1r4LaAA9vPBgHxNjniGw@mail.gmail.com>
	<CAKWX9VXiH_4czF-52eZu51sA-_cfTJX0JXaP_wi7G2VNZfGmiA@mail.gmail.com>
Date: Mon, 31 Aug 2015 16:30:03 +0530
Message-ID: 
 <CAFCxkQDb1DVjs0F4q5bOqMgSYH683nU=+uX3FG+w=+m2vUk_rw@mail.gmail.com>
Subject: Re: spark streaming 1.3 kafka topic error
From: Shushant Arora <shushantarora09@gmail.com>
To: Cody Koeninger <cody@koeninger.org>
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=047d7bb04fbee02cbc051e9955e6

--047d7bb04fbee02cbc051e9955e6
Content-Type: text/plain; charset=UTF-8

Say if my cluster takes long time for rebalance for some reason
intermittently . So to handle that Can I have infinite retries instead of
killing the app? What should be the value of retries (-1) will work or
something else ?

On Thu, Aug 27, 2015 at 6:46 PM, Cody Koeninger <cody@koeninger.org> wrote:

> Your kafka broker died or you otherwise had a rebalance.
>
> Normally spark retries take care of that.
>
> Is there something going on with your kafka installation, that rebalance
> is taking especially long?
>
> Yes, increasing backoff / max number of retries will "help", but it's
> better to figure out what's going on with kafka.
>
> On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <shushantarora09@gmail.com
> > wrote:
>
>> Hi
>>
>> My streaming application gets killed with below error
>>
>> 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>> kafka.common.NotLeaderForPartitionException,
>> kafka.common.NotLeaderForPartitionException,
>> kafka.common.NotLeaderForPartitionException,
>> kafka.common.NotLeaderForPartitionException,
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopic,100],
>> [testtopic,193]))
>> 15/08/26 21:55:20 ERROR scheduler.JobScheduler: Error generating jobs for
>> time 1440626120000 ms
>> org.apache.spark.SparkException:
>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([testtopic,115]))
>> at
>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>> at
>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>> at
>>
>>
>>
>> Kafka params in job logs printed are :
>>  value.serializer = class
>> org.apache.kafka.common.serialization.StringSerializer
>>         key.serializer = class
>> org.apache.kafka.common.serialization.StringSerializer
>>         block.on.buffer.full = true
>>         retry.backoff.ms = 100
>>         buffer.memory = 1048576
>>         batch.size = 16384
>>         metrics.sample.window.ms = 30000
>>         metadata.max.age.ms = 300000
>>         receive.buffer.bytes = 32768
>>         timeout.ms = 30000
>>         max.in.flight.requests.per.connection = 5
>>         bootstrap.servers = [broker1:9092, broker2:9092, broker3:9092]
>>         metric.reporters = []
>>         client.id =
>>         compression.type = none
>>         retries = 0
>>         max.request.size = 1048576
>>         send.buffer.bytes = 131072
>>         acks = all
>>         reconnect.backoff.ms = 10
>>         linger.ms = 0
>>         metrics.num.samples = 2
>>         metadata.fetch.timeout.ms = 60000
>>
>>
>> Is it kafka broker getting down and job is getting killed ? Whats the
>> best way to handle it ?
>> Increasing retries and backoff time  wil help and to what values those
>> should be set to never have streaming application failure - rather it keep
>> on retrying after few seconds and send a event so that my custom code can
>> send notification of kafka broker down if its because of that.
>>
>>
>> Thanks
>>
>>
>

--047d7bb04fbee02cbc051e9955e6
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Say if my cluster takes long time for rebalance for some r=
eason intermittently . So to handle that Can I have infinite retries instea=
d of killing the app? What should be the value of retries (-1) will work or=
 something else ?</div><div class=3D"gmail_extra"><br><div class=3D"gmail_q=
uote">On Thu, Aug 27, 2015 at 6:46 PM, Cody Koeninger <span dir=3D"ltr">&lt=
;<a href=3D"mailto:cody@koeninger.org" target=3D"_blank">cody@koeninger.org=
</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Y=
our kafka broker died or you otherwise had a rebalance.<div><br></div><div>=
Normally spark retries take care of that.</div><div><br></div><div>Is there=
 something going on with your kafka installation, that rebalance is taking =
especially long?</div><div><br></div><div>Yes, increasing backoff / max num=
ber of retries will &quot;help&quot;, but it&#39;s better to figure out wha=
t&#39;s going on with kafka.</div></div><div class=3D"HOEnZb"><div class=3D=
"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Aug =
26, 2015 at 9:07 PM, Shushant Arora <span dir=3D"ltr">&lt;<a href=3D"mailto=
:shushantarora09@gmail.com" target=3D"_blank">shushantarora09@gmail.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi<di=
v><br></div><div>My streaming application gets killed with below error</div=
><div><br></div><div><div>5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStr=
eam: ArrayBuffer(kafka.common.NotLeaderForPartitionException, kafka.common.=
NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException=
, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPar=
titionException, org.apache.spark.SparkException: Couldn&#39;t find leader =
offsets for Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopi=
c,100], [testtopic,193]))</div><div>15/08/26 21:55:20 ERROR scheduler.JobSc=
heduler: Error generating jobs for time 1440626120000 ms</div><div>org.apac=
he.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionExce=
ption, org.apache.spark.SparkException: Couldn&#39;t find leader offsets fo=
r Set([testtopic,115]))</div><div><span style=3D"white-space:pre-wrap">	</s=
pan>at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeade=
rOffsets(DirectKafkaInputDStream.scala:94)</div><div><span style=3D"white-s=
pace:pre-wrap">	</span>at org.apache.spark.streaming.kafka.DirectKafkaInput=
DStream.compute(DirectKafkaInputDStream.scala:116)</div><div><span style=3D=
"white-space:pre-wrap">	</span>at org.apache.spark.streaming.dstream.DStrea=
m$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)</div><div><sp=
an style=3D"white-space:pre-wrap">	</span>at org.apache.spark.streaming.dst=
ream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)</d=
iv><div><span style=3D"white-space:pre-wrap">	</span>at scala.util.DynamicV=
ariable.withValue(DynamicVariable.scala:57)</div><div><span style=3D"white-=
space:pre-wrap">	</span>at org.apache.spark.streaming.dstream.DStream$$anon=
fun$getOrCompute$1.apply(DStream.scala:299)</div><div><span style=3D"white-=
space:pre-wrap">	</span>at=C2=A0<br></div></div><div><br></div><div><br></d=
iv><div><br></div><div>Kafka params in job logs printed are :</div><div><di=
v>=C2=A0value.serializer =3D class org.apache.kafka.common.serialization.St=
ringSerializer</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 key.serializer =3D cla=
ss org.apache.kafka.common.serialization.StringSerializer</div><div>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 block.on.buffer.full =3D true</div><div>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 <a href=3D"http://retry.backoff.ms" target=3D"_blank">retry.b=
ackoff.ms</a> =3D 100</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 buffer.memory =
=3D 1048576</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 batch.size =3D 16384</div=
><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"http://metrics.sample.window.m=
s" target=3D"_blank">metrics.sample.window.ms</a> =3D 30000</div><div>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"http://metadata.max.age.ms" target=3D"_=
blank">metadata.max.age.ms</a> =3D 300000</div><div>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 receive.buffer.bytes =3D 32768</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 <a href=3D"http://timeout.ms" target=3D"_blank">timeout.ms</a> =3D 30000</=
div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 max.in.flight.requests.per.connection =
=3D 5</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 bootstrap.servers =3D [broker1:=
9092, broker2:9092, broker3:9092]</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 met=
ric.reporters =3D []</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"http:=
//client.id" target=3D"_blank">client.id</a> =3D</div><div>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 compression.type =3D none</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 retries =3D 0</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 max.request.size =
=3D 1048576</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 send.buffer.bytes =3D 131=
072</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 acks =3D all</div><div>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 <a href=3D"http://reconnect.backoff.ms" target=3D"_blank"=
>reconnect.backoff.ms</a> =3D 10</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a h=
ref=3D"http://linger.ms" target=3D"_blank">linger.ms</a> =3D 0</div><div>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 metrics.num.samples =3D 2</div><div>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 <a href=3D"http://metadata.fetch.timeout.ms" target=3D"_b=
lank">metadata.fetch.timeout.ms</a> =3D 60000</div></div><div><br></div><di=
v><br></div><div>Is it kafka broker getting down and job is getting killed =
? Whats the best way to handle it ?</div><div>Increasing retries and backof=
f time =C2=A0wil help and to what values those should be set to never have =
streaming application failure - rather it keep on retrying after few second=
s and send a event so that my custom code can send notification of kafka br=
oker down if its because of that.</div><div><br></div><div><br></div><div>T=
hanks</div><div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7bb04fbee02cbc051e9955e6--