From Fabian Hueske <fhue...@gmail.com>
Subject Re: External DB as sink - with processing guarantees
Date Sat, 12 Mar 2016 09:14:05 GMT
Hi Josh,

Flink can guarantee exactly-once processing within its data flow given that
the data sources allow to replay data from a specific position in the
stream. For example, Flink's Kafka Consumer supports exactly-once.

Flink achieves exactly-once processing by resetting operator state to a
consistent state and replaying data. This means that data might actually be
processed more than once, but the operator state will reflect exactly-once
semantics because it was reset. Ensuring exactly-once end-to-end it
difficult, because Flink does not control (and cannot reset) the state of
the sinks. By default, data can be sent more than once to a sink resulting
in at-least-once semantics at the sink.

This issue can be addressed, if the sink provides transactional writes
(previous writes can be undone) or if the writes are idempotent (applying
them several times does not change the result). Transactional support would
need to be integrated with Flink's SinkFunction. This is not the case for
Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but
you would need to implement a SinkFunction with transactional support or
use idempotent writes if you want to achieve exactly-once results.

Best, Fabian

2016-03-12 9:57 GMT+01:00 Josh <jofo90@gmail.com>:

> Thanks Nick, that sounds good. I would still like to have an understanding
> of what determines the processing guarantee though. Say I use a DynamoDB
> Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if
> it's at-least-once, is there a way to adapt it to achieve exactly-once?
> Thanks,
> Josh
> On 12 Mar 2016, at 02:46, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> Pretty much anything you can write to from a Hadoop MapReduce program can
> be a Flink destination. Just plug in the OutputFormat and go.
> Re: output semantics, your mileage may vary. Flink should do you fine for
> at least once.
> On Friday, March 11, 2016, Josh <jofo90@gmail.com> wrote:
>> Hi all,
>> I want to use an external data store (DynamoDB) as a sink with Flink. It
>> looks like there's no connector for Dynamo at the moment, so I have two
>> questions:
>> 1. Is it easy to write my own sink for Flink and are there any docs
>> around how to do this?
>> 2. If I do this, will I still be able to have Flink's processing
>> guarantees? I.e. Can I be sure that every tuple has contributed to the
>> DynamoDB state either at-least-once or exactly-once?
>> Thanks for any advice,
>> Josh

