kafka-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ij...@apache.org
Subject [05/14] kafka-site git commit: Update site for 1.0.0 release
Date Wed, 01 Nov 2017 12:58:45 GMT
http://git-wip-us.apache.org/repos/asf/kafka-site/blob/cb024c13/100/streams/core-concepts.html
----------------------------------------------------------------------
diff --git a/100/streams/core-concepts.html b/100/streams/core-concepts.html
new file mode 100644
index 0000000..baf8d9b
--- /dev/null
+++ b/100/streams/core-concepts.html
@@ -0,0 +1,187 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<script><!--#include virtual="../js/templateData.js" --></script>
+
+<script id="content-template" type="text/x-handlebars-template">
+    <h1>Core Concepts</h1>
+
+    <p>
+        Kafka Streams is a client library for processing and analyzing data stored in Kafka.
+        It builds upon important stream processing concepts such as properly distinguishing
between event time and processing time, windowing support, and simple yet efficient management
and real-time querying of application state.
+    </p>
+    <p>
+        Kafka Streams has a <b>low barrier to entry</b>: You can quickly write
and run a small-scale proof-of-concept on a single machine; and you only need to run additional
instances of your application on multiple machines to scale up to high-volume production workloads.
+        Kafka Streams transparently handles the load balancing of multiple instances of the
same application by leveraging Kafka's parallelism model.
+    </p>
+    <p>
+        Some highlights of Kafka Streams:
+    </p>
+
+    <ul>
+        <li>Designed as a <b>simple and lightweight client library</b>,
which can be easily embedded in any Java application and integrated with any existing packaging,
deployment and operational tools that users have for their streaming applications.</li>
+        <li>Has <b>no external dependencies on systems other than Apache Kafka
itself</b> as the internal messaging layer; notably, it uses Kafka's partitioning model
to horizontally scale processing while maintaining strong ordering guarantees.</li>
+        <li>Supports <b>fault-tolerant local state</b>, which enables very
fast and efficient stateful operations like windowed joins and aggregations.</li>
+        <li>Supports <b>exactly-once</b> processing semantics to guarantee
that each record will be processed once and only once even when there is a failure on either
Streams clients or Kafka brokers in the middle of processing.</li>
+        <li>Employs <b>one-record-at-a-time processing</b> to achieve millisecond
processing latency, and supports <b>event-time based windowing operations</b>
with late arrival of records.</li>
+        <li>Offers necessary stream processing primitives, along with a <b>high-level
Streams DSL</b> and a <b>low-level Processor API</b>.</li>
+
+    </ul>
+
+    <p>
+        We first summarize the key concepts of Kafka Streams.
+    </p>
+
+    <h3><a id="streams_topology" href="#streams_topology">Stream Processing Topology</a></h3>
+
+    <ul>
+        <li>A <b>stream</b> is the most important abstraction provided
by Kafka Streams: it represents an unbounded, continuously updating data set. A stream is
an ordered, replayable, and fault-tolerant sequence of immutable data records, where a <b>data
record</b> is defined as a key-value pair.</li>
+        <li>A <b>stream processing application</b> is any program that
makes use of the Kafka Streams library. It defines its computational logic through one or
more <b>processor topologies</b>, where a processor topology is a graph of stream
processors (nodes) that are connected by streams (edges).</li>
+        <li>A <b>stream processor</b> is a node in the processor topology;
it represents a processing step to transform data in streams by receiving one input record
at a time from its upstream processors in the topology, applying its operation to it, and
may subsequently produce one or more output records to its downstream processors. </li>
+    </ul>
+
+    There are two special processors in the topology:
+
+    <ul>
+        <li><b>Source Processor</b>: A source processor is a special type
of stream processor that does not have any upstream processors. It produces an input stream
to its topology from one or multiple Kafka topics by consuming records from these topics and
forwarding them to its down-stream processors.</li>
+        <li><b>Sink Processor</b>: A sink processor is a special type of
stream processor that does not have down-stream processors. It sends any received records
from its up-stream processors to a specified Kafka topic.</li>
+    </ul>
+
+    Note that in normal processor nodes other remote systems can also be accessed while processing
the current record. Therefore the processed results can either be streamed back into Kafka
or written to an external system.
+
+    <img class="centered" src="/{{version}}/images/streams-architecture-topology.jpg"
style="width:400px">
+
+    <p>
+        Kafka Streams offers two ways to define the stream processing topology: the <a
href="/{{version}}/documentation/streams/developer-guide#streams_dsl"><b>Kafka Streams
DSL</b></a> provides
+        the most common data transformation operations such as <code>map</code>,
<code>filter</code>, <code>join</code> and <code>aggregations</code>
out of the box; the lower-level <a href="/{{version}}/documentation/streams/developer-guide#streams_processor"><b>Processor
API</b></a> allows
+        developers define and connect custom processors as well as to interact with <a
href="#streams_state">state stores</a>.
+    </p>
+
+    <p>
+        A processor topology is merely a logical abstraction for your stream processing code.
+        At runtime, the logical topology is instantiated and replicated inside the application
for parallel processing (see <a href="#streams_architecture_tasks"><b>Stream Partitions
and Tasks</b></a> for details).
+    </p>
+
+    <h3><a id="streams_time" href="#streams_time">Time</a></h3>
+
+    <p>
+        A critical aspect in stream processing is the notion of <b>time</b>,
and how it is modeled and integrated.
+        For example, some operations such as <b>windowing</b> are defined based
on time boundaries.
+    </p>
+    <p>
+        Common notions of time in streams are:
+    </p>
+
+    <ul>
+        <li><b>Event time</b> - The point in time when an event or data
record occurred, i.e. was originally created "at the source". <b>Example:</b>
If the event is a geo-location change reported by a GPS sensor in a car, then the associated
event-time would be the time when the GPS sensor captured the location change.</li>
+        <li><b>Processing time</b> - The point in time when the event or
data record happens to be processed by the stream processing application, i.e. when the record
is being consumed. The processing time may be milliseconds, hours, or days etc. later than
the original event time. <b>Example:</b> Imagine an analytics application that
reads and processes the geo-location data reported from car sensors to present it to a fleet
management dashboard. Here, processing-time in the analytics application might be milliseconds
or seconds (e.g. for real-time pipelines based on Apache Kafka and Kafka Streams) or hours
(e.g. for batch pipelines based on Apache Hadoop or Apache Spark) after event-time.</li>
+        <li><b>Ingestion time</b> - The point in time when an event or
data record is stored in a topic partition by a Kafka broker. The difference to event time
is that this ingestion timestamp is generated when the record is appended to the target topic
by the Kafka broker, not when the record is created "at the source". The difference to processing
time is that processing time is when the stream processing application processes the record.
<b>For example,</b> if a record is never processed, there is no notion of processing
time for it, but it still has an ingestion time.</li>
+    </ul>
+    <p>
+        The choice between event-time and ingestion-time is actually done through the configuration
of Kafka (not Kafka Streams): From Kafka 0.10.x onwards, timestamps are automatically embedded
into Kafka messages. Depending on Kafka's configuration these timestamps represent event-time
or ingestion-time. The respective Kafka configuration setting can be specified on the broker
level or per topic. The default timestamp extractor in Kafka Streams will retrieve these embedded
timestamps as-is. Hence, the effective time semantics of your application depend on the effective
Kafka configuration for these embedded timestamps.
+    </p>
+    <p>
+        Kafka Streams assigns a <b>timestamp</b> to every data record via the
<code>TimestampExtractor</code> interface.
+        These per-record timestamps describe the progress of a stream with regards to time
and are leveraged by	time-dependent operations such as window operations.
+        As a result, this time will only advance when a new record arrives at the processor.
+        We call this data-driven time the <b>stream time</b> of the application
to differentiate with the <b>wall-clock time</b> when this application is actually
executing.
+        Concrete implementations of the <code>TimestampExtractor</code> interface
will then provide different semantics to the stream time definition.
+        For example retrieving or computing timestamps based on the actual contents of data
records such as an embedded timestamp field to provide event time semantics,
+        and returning the current wall-clock time thereby yield processing time semantics
to stream time.
+        Developers can thus enforce different notions of time depending on their business
needs.
+    </p>
+
+    <p>
+        Finally, whenever a Kafka Streams application writes records to Kafka, then it will
also assign timestamps to these new records. The way the timestamps are assigned depends on
the context:
+    </p>
+
+    <ul>
+        <li> When new output records are generated via processing some input record,
for example, <code>context.forward()</code> triggered in the <code>process()</code>
function call, output record timestamps are inherited from input record timestamps directly.</li>
+        <li> When new output records are generated via periodic functions such as <code>Punctuator#punctuate()</code>,
the output record timestamp is defined as the current internal time (obtained through <code>context.timestamp()</code>)
of the stream task.</li>
+        <li> For aggregations, the timestamp of a resulting aggregate update record
will be that of the latest arrived input record that triggered the update.</li>
+    </ul>
+
+    <h3><a id="streams_state" href="#streams_state">States</a></h3>
+
+    <p>
+        Some stream processing applications don't require state, which means the processing
of a message is independent from
+        the processing of all other messages.
+        However, being able to maintain state opens up many possibilities for sophisticated
stream processing applications: you
+        can join input streams, or group and aggregate data records. Many such stateful operators
are provided by the <a href="/{{version}}/documentation/streams/developer-guide#streams_dsl"><b>Kafka
Streams DSL</b></a>.
+    </p>
+    <p>
+        Kafka Streams provides so-called <b>state stores</b>, which can be used
by stream processing applications to store and query data.
+        This is an important capability when implementing stateful operations.
+        Every task in Kafka Streams embeds one or more state stores that can be accessed
via APIs to store and query data required for processing.
+        These state stores can either be a persistent key-value store, an in-memory hashmap,
or another convenient data structure.
+        Kafka Streams offers fault-tolerance and automatic recovery for local state stores.
+    </p>
+    <p>
+        Kafka Streams allows direct read-only queries of the state stores by methods, threads,
processes or applications external to the stream processing application that created the state
stores. This is provided through a feature called <b>Interactive Queries</b>.
All stores are named and Interactive Queries exposes only the read operations of the underlying
implementation.
+    </p>
+    <br>
+
+    <h2><a id="streams_processing_guarantee" href="#streams_processing_guarantee">Processing
Guarantees</a></h2>
+
+    <p>
+        In stream processing, one of the most frequently asked question is "does my stream
processing system guarantee that each record is processed once and only once, even if some
failures are encountered in the middle of processing?"
+        Failing to guarantee exactly-once stream processing is a deal-breaker for many applications
that cannot tolerate any data-loss or data duplicates, and in that case a batch-oriented framework
is usually used in addition
+        to the stream processing pipeline, known as the <a href="http://lambda-architecture.net/">Lambda
Architecture</a>.
+        Prior to 0.11.0.0, Kafka only provides at-least-once delivery guarantees and hence
any stream processing systems that leverage it as the backend storage could not guarantee
end-to-end exactly-once semantics.
+        In fact, even for those stream processing systems that claim to support exactly-once
processing, as long as they are reading from / writing to Kafka as the source / sink, their
applications cannot actually guarantee that
+        no duplicates will be generated throughout the pipeline.
+
+        Since the 0.11.0.0 release, Kafka has added support to allow its producers to send
messages to different topic partitions in a <a href="https://kafka.apache.org/documentation/#semantics">transactional
and idempotent manner</a>,
+        and Kafka Streams has hence added the end-to-end exactly-once processing semantics
by leveraging these features.
+        More specifically, it guarantees that for any record read from the source Kafka topics,
its processing results will be reflected exactly once in the output Kafka topic as well as
in the state stores for stateful operations.
+        Note the key difference between Kafka Streams end-to-end exactly-once guarantee with
other stream processing frameworks' claimed guarantees is that Kafka Streams tightly integrates
with the underlying Kafka storage system and ensure that
+        commits on the input topic offsets, updates on the state stores, and writes to the
output topics will be completed atomically instead of treating Kafka as an external system
that may have side-effects.
+        To read more details on how this is done inside Kafka Streams, readers are recommended
to read <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics">KIP-129</a>.
+
+        In order to achieve exactly-once semantics when running Kafka Streams applications,
users can simply set the <code>processing.guarantee</code> config value to <b>exactly_once</b>
(default value is <b>at_least_once</b>).
+        More details can be found in the <a href="/{{version}}/documentation#streamsconfigs"><b>Kafka
Streams Configs</b></a> section.
+    </p>
+
+    <div class="pagination">
+        <a href="/{{version}}/documentation/streams/developer-guide" class="pagination__btn
pagination__btn__prev">Previous</a>
+        <a href="/{{version}}/documentation/streams/architecture" class="pagination__btn
pagination__btn__next">Next</a>
+    </div>
+</script>
+
+<!--#include virtual="../../includes/_header.htm" -->
+<!--#include virtual="../../includes/_top.htm" -->
+<div class="content documentation documentation--current">
+    <!--#include virtual="../../includes/_nav.htm" -->
+    <div class="right">
+        <!--#include virtual="../../includes/_docs_banner.htm" -->
+        <ul class="breadcrumbs">
+            <li><a href="/documentation">Documentation</a></li>
+            <li><a href="/documentation/streams">Kafka Streams API</a></li>
+        </ul>
+        <div class="p-content"></div>
+    </div>
+</div>
+<!--#include virtual="../../includes/_footer.htm" -->
+<script>
+$(function() {
+  // Show selected style on nav item
+  $('.b-nav__streams').addClass('selected');
+
+  // Display docs subnav items
+  $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
+});
+</script>


Mime
View raw message