spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pushkar priyadarshi <priyadarshi.push...@gmail.com>
Subject Re: Kafka streaming "at least once" semantics
Date Fri, 09 Oct 2015 12:31:20 GMT
i am refering to back pressure implementation here.

On Fri, Oct 9, 2015 at 8:30 AM, pushkar priyadarshi <
priyadarshi.pushkar@gmail.com> wrote:

> Spark 1.5 kafka direct i think does not store messages rather than it
> fetches messages as in when consumed in the pipeline.That would prevent you
> from having data loss.
>
>
>
> On Fri, Oct 9, 2015 at 7:34 AM, bitborn <andrew.clarkson@ave81.com> wrote:
>
>> Hi all,
>>
>> My company is using Spark streaming and the Kafka API's to process an
>> event
>> stream. We've got most of our application written, but are stuck on "at
>> least once" processing.
>>
>> I created a demo to show roughly what we're doing here:
>> https://github.com/bitborn/resilient-kafka-streaming-in-spark
>> <https://github.com/bitborn/resilient-kafka-streaming-in-spark>
>>
>> The problem we're having is when the application experiences an exception
>> (network issue, out of memory, etc) it will drop the batch it's
>> processing.
>> The ideal behavior is it will process each event "at least once" even if
>> that means processing it more than once. Whether this happens via
>> checkpointing, WAL, or kafka offsets is irrelevant, as long as we don't
>> drop
>> data. :)
>>
>> A couple of things we've tried:
>> - Using the kafka direct stream API (via  Cody Koeninger
>> <
>> https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala
>> >
>> )
>> - Using checkpointing with both the low-level and high-level API's
>> - Enabling the write ahead log
>>
>> I've included a log here  spark.log
>> <
>> https://github.com/bitborn/resilient-kafka-streaming-in-spark/blob/master/spark.log
>> >
>> , but I'm afraid it doesn't reveal much.
>>
>> The fact that others seem to be able to get this working properly suggests
>> we're missing some magic configuration or are possibly executing it in a
>> way
>> that won't support the desired behavior.
>>
>> I'd really appreciate some pointers!
>>
>> Thanks much,
>> Andrew Clarkson
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-at-least-once-semantics-tp24995.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message