spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bitborn <>
Subject Kafka streaming "at least once" semantics
Date Fri, 09 Oct 2015 11:34:04 GMT
Hi all,

My company is using Spark streaming and the Kafka API's to process an event
stream. We've got most of our application written, but are stuck on "at
least once" processing.

I created a demo to show roughly what we're doing here:

The problem we're having is when the application experiences an exception
(network issue, out of memory, etc) it will drop the batch it's processing.
The ideal behavior is it will process each event "at least once" even if
that means processing it more than once. Whether this happens via
checkpointing, WAL, or kafka offsets is irrelevant, as long as we don't drop
data. :)

A couple of things we've tried:
- Using the kafka direct stream API (via  Cody Koeninger

- Using checkpointing with both the low-level and high-level API's
- Enabling the write ahead log

I've included a log here  spark.log

, but I'm afraid it doesn't reveal much.

The fact that others seem to be able to get this working properly suggests
we're missing some magic configuration or are possibly executing it in a way
that won't support the desired behavior.

I'd really appreciate some pointers!

Thanks much,
Andrew Clarkson

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message