kafka-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jkr...@apache.org
Subject svn commit: r1575098 - /kafka/site/081/design.html
Date Thu, 06 Mar 2014 23:35:34 GMT
Author: jkreps
Date: Thu Mar  6 23:35:34 2014
New Revision: 1575098

URL: http://svn.apache.org/r1575098
Log:
Give an example of a stream and compaction.

Modified:
    kafka/site/081/design.html

Modified: kafka/site/081/design.html
URL: http://svn.apache.org/viewvc/kafka/site/081/design.html?rev=1575098&r1=1575097&r2=1575098&view=diff
==============================================================================
--- kafka/site/081/design.html (original)
+++ kafka/site/081/design.html Thu Mar  6 23:35:34 2014
@@ -235,9 +235,23 @@ It is also important to optimize the lea
 
 One of the ways Kafka is different from traditional messaging systems is that it maintains
historical log data beyond just the currently in-flight messages. This allows a set of usage
patterns and system architectures that make use of this as a kind of external commit log.
Log compaction is a feature that helps to support this kind of use case.
 <p>
-So far we have described only the simpler approach to data retention where old log data is
discarded after a fixed period of time or when the log reaches some predetermined size. This
works well for temporal event data such as logging where each record stands alone. However
an important class of data streams are the log of changes to keyed, mutable data (for example,
the changes to a database). For these data streams maintaining only the most recent changes
means that the log will have only a subset of the full data set, which limits its usefulness.
However maintaining the full log of all updates for all time would require an unbounded amount
of space. Log compaction addresses this issue by adding a data retention mechanism that allows
pruning obsolete values by primary key rather than requiring that full segments be discarded
all together.
+So far we have described only the simpler approach to data retention where old log data is
discarded after a fixed period of time or when the log reaches some predetermined size. This
works well for temporal event data such as logging where each record stands alone. However
an important class of data streams are the log of changes to keyed, mutable data (for example,
the changes to a database table).
 <p>
-Let's start with a few examples of use cases that log updates, then we'll talk about how
Kafka's log compaction supports these use cases.
+Let's discuss a concrete example of such a stream. Say we have a topic containing user email
addresses; every time a user updates their email address we send a message to this topic using
their user id as the primary key. Now say we send the following messages over some time period
for a user with id 123, each message corresponding to a change in email address:
+<pre>
+	123 => bill@microsoft.com
+	        .
+	        .
+	        .
+	123 => bill@gatesfoundation.org
+	        .
+	        .
+	        .
+	123 => bill@gmail.com
+</pre>
+Log compaction gives us a more granular retention mechanism so that we are guaranteed to
retain at least the last update for each primary key (e.g. <code>bill@gmail.com</code>).
By doing this we guarantee that the log contains a full snapshot of the final value for every
key not just keys that changed recently. This means downstream consumers can restore their
own state off this topic without us having to retain a complete log of all changes.
+<p>
+Let's start by looking at a few use cases where this is useful, then we'll see how it can
be used.
 <ol>
 <li><i>Database change subscription</i>. It is often necessary to have
a data set in multiple data systems, and often one of these systems is a database of some
kind (either a RDBMS or perhaps a new-fangled key-value store). For example you might have
a database, a cache, a search cluster, and a Hadoop cluster. Each change to the database will
need to be reflected in the cache, the search cluster, and eventually in Hadoop. In the case
that one is only handling the real-time updates you only need recent log. But if you want
to be able to reload the cache or restore a failed search node you may need a complete data
set.
 <li><i>Event sourcing</i>. This is a style of application design which
co-locates query processing with application design and uses a log of changes as the primary
store for the application.



Mime
View raw message