Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of litao.buptsse@gmail.com
 designates 209.85.192.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1429295453207.e601c503@Nodemailer>
References: 
 <CA+A96v0P2McTL-3PXzv3Efgq715k53gbsFEo+HrQCNOdvojUQQ@mail.gmail.com>
	<1429295453207.e601c503@Nodemailer>
Date: Sat, 18 Apr 2015 02:51:30 +0800
Message-ID: 
 <CA+A96v10rt2Qa_KnO9MPzfEjshAs7x04WgfOu0EQtCccHs5ofg@mail.gmail.com>
Subject: Re: [HDFSEventSink] Endless loop when HDFSEventSink.process() thorws
 exception
From: Tao Li <litao.buptsse@gmail.com>
To: user@flume.apache.org
Content-Type: multipart/alternative; boundary=001a11476f568712dc0513f011ef

--001a11476f568712dc0513f011ef
Content-Type: text/plain; charset=UTF-8

@Gwen @Hari

My use case is as follows:
ScribeClient => [Agent1: ScribeSource => KafkaChannel1] => Kafka Cluster =>
[Agent2: KafkaCluster2 => HDFSEventSink] => HDFS

The bad case is as follows:
My HDFSEventSink need a header "*timestamp*", but some dirty data(by
mistake) in Kafka doesn't has the "timestamp" headers, which cause the
following BucketPath.escapeString thows *NullPointerException*.
String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);

*I think Gwen's second point is OK, we can add a interceptor to do the
filter job.*

But my flume agent is kind of special:
For Agent1, doesn't have sink, directly send message to kafak cluster by
KafkaChannel1.
For Agent2, doesn't have source, directly poll event from kafka cluster by
KafkaChannel2.
Agent1 and Agent2 is different JVM and deploy on different node.

*I don't know if it's reasonable for a agent with no sink or no source?* But
I have already build the whold work flow, and it's works well for me for
regular cases.

*For Agent2, because of without source, so I can't use Gwen's Interceptor
suggestion.*

2015-04-18 2:30 GMT+08:00 Hari Shreedharan <hshreedharan@cloudera.com>:

> What I think he means is that a message in the channel that cannot be
> serialized by the serializer because it is malformed causing the serializer
> to fail and perhaps throw (think malformed Avro). Such a message basically
> would be stuck in an infinite loop. So the workaround in (2) would work if
> using a Kafka Source.
>
> Thanks,
> Hari
>
>
> On Fri, Apr 17, 2015 at 10:08 AM, Tao Li <litao.buptsse@gmail.com> wrote:
>
>> OK, I got it, Thanks.
>>
>> 2015-04-18 0:59 GMT+08:00 Hari Shreedharan <hshreedharan@cloudera.com>:
>>
>>> Are you using Kafka channel? The fix I mentioned was for file channel.
>>> Unfortunately, we don't plan to introduce something that drops data in real
>>> time. This makes it too easy for a misconfig to cause data loss. You'd have
>>> to ensure the data in the Kafka channel is valid.
>>>
>>> Thanks,
>>> Hari
>>>
>>>
>>> On Fri, Apr 17, 2015 at 9:41 AM, Tao Li <litao.buptsse@gmail.com> wrote:
>>>
>>>> @Hari, you mean I need to ensure the data in kafka is OK by myself,
>>>> right?
>>>>
>>>> How about we have a config to let user decide how to handle BACKOFF.
>>>>  For example, we can config the max retry num in process(), and also
>>>> config wether commit or not when exceed the max retry num.(In my kafka
>>>> case, when meet dirty data, commit the comsume offset will be nice for me
>>>> than endless loop)
>>>>
>>>> 2015-04-18 0:23 GMT+08:00 Hari Shreedharan <hshreedharan@cloudera.com>:
>>>>
>>>>> We recently added functionality to the file channel integrity tool
>>>>> that can be used to remove bad events from the channel - though you would
>>>>> need to write some code to validate events. It will be in the soon to be
>>>>> released 1.6.0
>>>>>
>>>>> Thanks,
>>>>> Hari
>>>>>
>>>>>
>>>>> On Fri, Apr 17, 2015 at 9:05 AM, Tao Li <litao.buptsse@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all:
>>>>>>
>>>>>> My use case is KafkaChannel + HDFSEventSink.
>>>>>>
>>>>>> I found that SinkRunner.PollingRunner will call
>>>>>> HDFSEventSink.process() in a while loop. For example, a message in kafka
>>>>>> contains dirty data, so HDFSEventSink.process() consume message from kafka,
>>>>>> throws exception because of *dirty data*, and *kafka offset doesn't
>>>>>> commit*. And the outer loop, will continue call
>>>>>> HDFSEventSink.process(). Because the kafka offset doesn't change, so
>>>>>> HDFSEventSink will consume the dirty data *again*. The bad loop is *never
>>>>>> stopped*.
>>>>>>
>>>>>>  *I want to know that if we have a **mechanism to cover this case?*
>>>>>> For example, we have a max retry num for a unique HDFSEventSink.process()
>>>>>> call and give up when exceed max limit.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--001a11476f568712dc0513f011ef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">@Gwen @Hari<div><br></div><div>My use case is as follows:<=
/div><div>ScribeClient =3D&gt; [Agent1: ScribeSource =3D&gt; KafkaChannel1]=
 =3D&gt; Kafka Cluster =3D&gt; [Agent2: KafkaCluster2 =3D&gt; HDFSEventSink=
] =3D&gt; HDFS</div><div><br></div><div>The bad case is as follows:<br></di=
v><div>My HDFSEventSink need a header &quot;<b>timestamp</b>&quot;, but som=
e dirty data(by mistake) in Kafka doesn&#39;t has the &quot;timestamp&quot;=
 headers, which cause the following BucketPath.escapeString thows <b>NullPo=
interException</b>.</div><div>String realPath =3D BucketPath.escapeString(f=
ilePath, event.getHeaders(), timeZone, needRounding, roundUnit, roundValue,=
 useLocalTime);<br></div><div><br></div><div><b>I think Gwen&#39;s second p=
oint is OK, we can add a interceptor to do the filter job.</b></div><div><b=
r></div><div>But my flume agent is kind of special:</div><div>For Agent1, d=
oesn&#39;t have sink, directly send message to kafak cluster by KafkaChanne=
l1.=C2=A0</div><div>For Agent2, doesn&#39;t have source, directly poll even=
t from kafka cluster by KafkaChannel2.</div><div>Agent1 and Agent2 is diffe=
rent JVM and deploy on different node.</div><div><br></div><div><b>I don=
9;t know if it&#39;s=C2=A0reasonable for a agent with no sink or no source?=
</b>=C2=A0But I have already build the whold work flow, and it&#39;s works =
well for me for regular cases.</div><div><br></div><div><b>For Agent2, beca=
use of without source, so I can&#39;t use Gwen&#39;s Interceptor suggestion=
.</b></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">=
2015-04-18 2:30 GMT+08:00 Hari Shreedharan <span dir=3D"ltr">&lt;<a href=3D=
"mailto:hshreedharan@cloudera.com" target=3D"_blank">hshreedharan@cloudera.=
com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span>What I think he means is that a message in the channel that cannot be=
 serialized by the serializer because it is malformed causing the serialize=
r to fail and perhaps throw (think malformed Avro). Such a message basicall=
y would be stuck in an infinite loop. So the workaround in (2) would work i=
f using a Kafka Source.=C2=A0</span><div>
<br>

Thanks,=C2=A0<span class=3D"HOEnZb"><font color=3D"#888888"><div>Hari</div>
</font></span></div><div class=3D"HOEnZb"><div class=3D"h5">
<br><br><div class=3D"gmail_quote"><p>On Fri, Apr 17, 2015 at 10:08 AM, Tao=
 Li <span dir=3D"ltr">&lt;<a href=3D"mailto:litao.buptsse@gmail.com" target=
=3D"_blank">litao.buptsse@gmail.com</a>&gt;</span> wrote:<br></p><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div>
<div dir=3D"ltr">OK, I got it, Thanks.</div>
<div class=3D"gmail_extra">
<br><div class=3D"gmail_quote">2015-04-18 0:59 GMT+08:00 Hari Shreedharan <=
span dir=3D"ltr">&lt;<a href=3D"mailto:hshreedharan@cloudera.com" target=3D=
"_blank">hshreedharan@cloudera.com</a>&gt;</span>:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<div>Are you using Kafka channel? The fix I mentioned was for file channel.=
 Unfortunately, we don&#39;t plan to introduce something that drops data in=
 real time. This makes it too easy for a misconfig to cause data loss. You&=
#39;d have to ensure the data in the Kafka channel is valid.</div>
<div>
<br>

Thanks,=C2=A0<span><font color=3D"#888888"><div>Hari</div>
</font></span>
</div>
<div><div>
<br><br><div class=3D"gmail_quote">
<p>On Fri, Apr 17, 2015 at 9:41 AM, Tao Li <span dir=3D"ltr">&lt;<a href=3D=
"mailto:litao.buptsse@gmail.com" target=3D"_blank">litao.buptsse@gmail.com<=
/a>&gt;</span> wrote:<br></p>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div dir=3D"ltr">@<span style=3D"font-s=
ize:14px;white-space:nowrap">Hari, you mean I need to ensure the data in ka=
fka is OK by myself, right?</span><div><span style=3D"font-size:14px;white-=
space:nowrap"><br></span></div>
<div><span style=3D"font-size:14px;white-space:nowrap">How about we have a =
config to let user decide how to handle BACKOFF.</span></div>
<div>
<span style=3D"font-size:14px;white-space:nowrap">For example, we can confi=
g the max retry num in</span><span style=3D"font-size:14px;white-space:nowr=
ap">=C2=A0</span><span style=3D"font-size:14px;white-space:nowrap">process(=
), and also config wether commit or not when exceed the max retry num.(In m=
y kafka case, when meet dirty data, commit the comsume offset will be nice =
for me than endless loop)</span>
</div>
<div><div class=3D"gmail_extra">
<br><div class=3D"gmail_quote">2015-04-18 0:23 GMT+08:00 Hari Shreedharan <=
span dir=3D"ltr">&lt;<a href=3D"mailto:hshreedharan@cloudera.com" target=3D=
"_blank">hshreedharan@cloudera.com</a>&gt;</span>:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border=
-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div>We recently added functionality to the file channel integrity tool tha=
t can be used to remove bad events from the channel - though you would need=
 to write some code to validate events. It will be in the soon to be releas=
ed 1.6.0</div>
<div>
<br>

Thanks,=C2=A0<span><font color=3D"#888888"><div>Hari</div>
</font></span>
</div>
<div><div>
<br><br><div class=3D"gmail_quote">
<p>On Fri, Apr 17, 2015 at 9:05 AM, Tao Li <span dir=3D"ltr">&lt;<a href=3D=
"mailto:litao.buptsse@gmail.com" target=3D"_blank">litao.buptsse@gmail.com<=
/a>&gt;</span> wrote:<br></p>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div><div dir=3D"ltr">Hi all:<div><br></div>
<div>My use case is KafkaChannel + HDFSEventSink.=C2=A0</div>
<div><br></div>
<div>I found that SinkRunner.PollingRunner will call HDFSEventSink.process(=
) in a while loop. For example, a message in kafka contains dirty data, so =
HDFSEventSink.process() consume message from kafka, throws exception becaus=
e of <b>dirty data</b>, and <b>kafka offset doesn&#39;t commit</b>. And the=
 outer loop, will continue call HDFSEventSink.process(). Because the kafka =
offset doesn&#39;t change, so HDFSEventSink will consume the dirty data <b>=
again</b>. The bad loop is <b>never stopped</b>.</div>
<div><br></div>
<div>
<b>I want to know that if we have a=C2=A0</b><span style=3D"color:rgb(51,51=
,51);font-family:arial;font-size:13px;line-height:18.2000007629395px"><b>me=
chanism to cover this case?</b> For example, we have a max retry num for a =
unique=C2=A0</span>HDFSEventSink.process() call and give up when exceed max=
 limit.</div>
<div><br></div>
</div></div></blockquote>
</div>
<br></div></div>
</blockquote>
</div>
<br></div></div>
</div></div></blockquote>
</div>
<br></div></div>
</blockquote>
</div>
<br></div>
</div></blockquote></div><br></div></div></blockquote></div><br></div>

--001a11476f568712dc0513f011ef--