kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ijuma <...@git.apache.org>
Subject [GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98
Date Thu, 15 Jun 2017 00:32:08 GMT
Github user ijuma commented on a diff in the pull request:

    https://github.com/apache/kafka-site/pull/60#discussion_r122096607
  
    --- Diff: 0110/design.html ---
    @@ -264,21 +264,22 @@
         messages have a primary key and so the updates are idempotent (receiving the same
message twice just overwrites a record with another copy of itself).
         </ol>
         <p>
    -    So what about exactly once semantics (i.e. the thing you actually want)? The limitation
here is not actually a feature of the messaging system but rather the need to coordinate the
consumer's position with
    -    what is actually stored as output. The classic way of achieving this would be to
introduce a two-phase commit between the storage for the consumer position and the storage
of the consumers output. But this can be
    -    handled more simply and generally by simply letting the consumer store its offset
in the same place as its output. This is better because many of the output systems a consumer
might want to write to will not
    -    support a two-phase commit. As an example of this, our Hadoop ETL that populates
data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that
either data and offsets are both updated
    -    or neither is. We follow similar patterns for many other data systems which require
these stronger semantics and for which the messages do not have a primary key to allow for
deduplication.
    -    <p>
    -    A special case is when the output system is just another Kafka topic (e.g. in a Kafka
Streams application). Here we can leverage the new transactional producer capabilities in
0.11.0.0 that were mentioned above.
    -    Since the consumer's position is stored as a message in a topic, we can ensure that
that topic is included in the same transaction as the output topics receiving the processed
data. If the transaction is aborted,
    -    the consumer's position will revert to its old value and none of the output data
will be visible to consumers. To enable this, consumers support an "isolation level" to achieve
this. In the default
    -    "read_uncommitted" mode, all messages are visible to consumers even if they were
part of an aborted transaction, but in "read_committed" mode, the consumer will only return
data from transactions which were committed
    -    (and any messages which were not part of any transaction).
    -    <p>
    -    So effectively Kafka guarantees at-least-once delivery by default, and allows the
user to implement at-most-once delivery by disabling retries on the producer and committing
its offset prior to processing a batch of
    -    messages. Exactly-once delivery is supported when processing messages between Kafka
topics, such as in Kafka Streams applications. Exactly-once delivery for other destination
storage system generally requires
    -    cooperation with that system, but Kafka provides the offset which makes implementing
this straight-forward.
    +    So what about exactly once semantics (i.e. the thing you actually want)? When consuming
from a Kafka topic and producing to another topic (as in a <a href="https://kafka.apache.org/documentation/streams">Kafka
Streams</a>
    +    application), we can leverage the new transactional producer capabilities in 0.11.0.0
that were mentioned above. The consumer's position is stored as a message in a topic, so we
can write the offset to Kafka in the
    +    same transaction as the output topics receiving the processed data. If the transaction
is aborted, the consumer's position will revert to its old value and none of the output data
will be visible to consumers. Consumers support an "isolation level" configuration
    +    to achieve this. In the default "read_uncommitted" mode, all messages are visible
to consumers even if they were part of an aborted transaction, but in "read_committed" mode,
the consumer will only return data from
    +    transactions which were committed (and any messages which were not part of any transaction).
    +    <p>
    +    When writing to an external system, the limitation is in the need to coordinate the
consumer's position with what is actually stored as output. The classic way of achieving this
would be to introduce a two-phase
    +    commit between the storage for the consumer position and the storage of the consumers
output. But this can be handled more simply and generally by simply letting the consumer store
its offset in the same place as
    +    its output. This is better because many of the output systems a consumer might want
to write to will not support a two-phase commit. As an example of this, our Hadoop ETL that
populates data in HDFS stores its
    +    offsets in HDFS with the data it reads so that it is guaranteed that either data
and offsets are both updated or neither is. We follow similar patterns for many other data
systems which require these stronger
    +    semantics and for which the messages do not have a primary key to allow for deduplication.
    +    <p>
    +    So effectively Kafka supports exactly-once delivery in <a href="https://kafka.apache.org/documentation/streams">Kafka
Streams</a>, and the transactional producer/consumer can be used generally to provide
    +    exactly-once delivery when transfering and processing data between Kafka topics.
Exactly-once delivery for other destination storage system generally requires cooperation
with that system, but Kafka provides the
    +    offset which makes implementing this straight-forward. Otherwise, Kafka guarantees
at-least-once delivery by default, and allows the user to implement at-most-once delivery
by disabling retries on the producer and
    --- End diff --
    
    A bit optimistic: "straight-forward" :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message