spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Koeninger (JIRA)" <>
Subject [jira] [Commented] (SPARK-18258) Sinks need access to offset representation
Date Fri, 04 Nov 2016 20:27:58 GMT


Cody Koeninger commented on SPARK-18258:

The sink doesn't have to reason about equality of the representations.

It just has to be able to store those representations, in addition the batch id if necessary,
so that the job can be recovered if spark fails in a way that renders the batch id meaningless
or the user wants to switch to a different streaming system.

> Sinks need access to offset representation
> ------------------------------------------
>                 Key: SPARK-18258
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>            Reporter: Cody Koeninger
> Transactional "exactly-once" semantics for output require storing an offset identifier
in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not the actual
offset representation.
> I want to store the actual offsets, so that they are recoverable as long as the results
are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for the starting
and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation).
 That would be an API change, but if there's another way to map batch ids to offset representations
without changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a job as e.g.
the Kafka dstream gives, because Sinks are the main place that should need them.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message