kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From t...@apache.org
Subject [1/2] kudu-site git commit: Publish commit(s) from site source repo: 768aba2 Add blog post for 1.3.1
Date Wed, 19 Apr 2017 17:11:03 GMT
Repository: kudu-site
Updated Branches:
  refs/heads/asf-site 8a656ba69 -> bcbdb4d84


http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/3/index.html
----------------------------------------------------------------------
diff --git a/blog/page/3/index.html b/blog/page/3/index.html
index ad5262c..d8593cd 100644
--- a/blog/page/3/index.html
+++ b/blog/page/3/index.html
@@ -111,6 +111,320 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/08/31/intro-flume-kudu-sink.html">An Introduction to the Flume Kudu Sink</a></h1>
+    <p class="meta">Posted 31 Aug 2016 by Ara Abrahamian</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>This post discusses the Kudu Flume Sink. First, I&#8217;ll give some background on why we considered
+using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.</p>
+
+<h2 id="why-kudu">Why Kudu</h2>
+
+<p>Traditionally in the Hadoop ecosystem we&#8217;ve dealt with various <em>batch processing</em> technologies such
+as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
+Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
+process the whole data set in batches, again and again, as soon as new data gets added. Things get
+really complicated when a few such tasks need to get chained together, or when the same data set
+needs to be processed in various ways by different jobs, while all compete for the shared cluster
+resources.</p>
+
+<p>The opposite of this approach is <em>stream processing</em>: process the data as soon as it arrives, not
+in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
+this possible. But writing streaming services is not trivial. The streaming systems are becoming
+more and more capable and support more complex constructs, but they are not yet easy to use. All
+queries and processes need to be carefully planned and implemented.</p>
+
+<p>To summarize, <em>batch processing</em> is:</p>
+
+<ul>
+  <li>file-based</li>
+  <li>a paradigm that processes large chunks of data as a group</li>
+  <li>high latency and high throughput, both for ingest and query</li>
+  <li>typically easy to program, but hard to orchestrate</li>
+  <li>well suited for writing ad-hoc queries, although they are typically high latency</li>
+</ul>
+
+<p>While <em>stream processing</em> is:</p>
+
+<ul>
+  <li>a totally different paradigm, which involves single events and time windows instead of large groups of events</li>
+  <li>still file-based and not a long-term database</li>
+  <li>not batch-oriented, but incremental</li>
+  <li>ultra-fast ingest and ultra-fast query (query results basically pre-calculated)</li>
+  <li>not so easy to program, relatively easy to orchestrate</li>
+  <li>impossible to write ad-hoc queries</li>
+</ul>
+
+<p>And a Kudu-based <em>near real-time</em> approach is:</p>
+
+<ul>
+  <li>flexible and expressive, thanks to SQL support via Apache Impala (incubating)</li>
+  <li>a table-oriented, mutable data store that feels like a traditional relational database</li>
+  <li>very easy to program, you can even pretend it&#8217;s good old MySQL</li>
+  <li>low-latency and relatively high throughput, both for ingest and query</li>
+</ul>
+
+<p>At Argyle Data, we&#8217;re dealing with complex fraud detection scenarios. We need to ingest massive
+amounts of data, run machine learning algorithms and generate reports. When we created our current
+architecture two years ago we decided to opt for a database as the backbone of our system. That
+database is Apache Accumulo. It&#8217;s a key-value based database which runs on top of Hadoop HDFS,
+quite similar to HBase but with some important improvements such as cell level security and ease
+of deployment and management. To enable querying of this data for quite complex reporting and
+analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
+by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
+architecture has served us well, but there were a few problems:</p>
+
+<ul>
+  <li>we need to ingest even more massive volumes of data in real-time</li>
+  <li>we need to perform complex machine-learning calculations on even larger data-sets</li>
+  <li>we need to support ad-hoc queries, plus long-term data warehouse functionality</li>
+</ul>
+
+<p>So, we&#8217;ve started gradually moving the core machine-learning pipeline to a streaming based
+solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
+would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
+the machine learning pipeline ingests and processes real-time data, we store a copy of the same
+ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our <em>data warehouse</em>. By
+using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala&#8217;s
+super-fast query engine.</p>
+
+<p>But how would we make sure data is reliably ingested into the streaming pipeline <em>and</em> the
+Kudu-based data warehouse? This is where Apache Flume comes in.</p>
+
+<h2 id="why-flume">Why Flume</h2>
+
+<p>According to their <a href="http://flume.apache.org/">website</a> &#8220;Flume is a distributed, reliable, and
+available service for efficiently collecting, aggregating, and moving large amounts of log data.
+It has a simple and flexible architecture based on streaming data flows. It is robust and fault
+tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.&#8221; As you
+can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
+clusters.</p>
+
+<p><img src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="png" /></p>
+
+<p>Flume has an extensible architecture. An instance of Flume, called an <em>agent</em>, can have multiple
+<em>channels</em>, with each having multiple <em>sources</em> and <em>sinks</em> of various types. Sources queue data
+in channels, which in turn write out data to sinks. Such <em>pipelines</em> can be chained together to
+create even more complex ones. There may be more than one agent and agents can be configured to
+support failover and recovery.</p>
+
+<p>Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
+default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
+File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
+source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
+data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.</p>
+
+<p>In the rest of this post I&#8217;ll go over the Kudu Flume sink and show you how to configure Flume to
+write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
+release and the source code can be found <a href="https://github.com/apache/kudu/tree/master/java/kudu-flume-sink">here</a>.</p>
+
+<h2 id="configuring-the-kudu-flume-sink">Configuring the Kudu Flume Sink</h2>
+
+<p>Here is a sample flume configuration file:</p>
+
+<pre><code>agent1.sources  = source1
+agent1.channels = channel1
+agent1.sinks = sink1
+
+agent1.sources.source1.type = exec
+agent1.sources.source1.command = /usr/bin/vmstat 1
+agent1.sources.source1.channels = channel1
+
+agent1.channels.channel1.type = memory
+agent1.channels.channel1.capacity = 10000
+agent1.channels.channel1.transactionCapacity = 1000
+
+agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
+agent1.sinks.sink1.masterAddresses = localhost
+agent1.sinks.sink1.tableName = stats
+agent1.sinks.sink1.channel = channel1
+agent1.sinks.sink1.batchSize = 50
+agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
+</code></pre>
+
+<p>We define a source called <code>source1</code> which simply executes a <code>vmstat</code> command to continuously generate
+virtual memory statistics for the machine and queue events into an in-memory <code>channel1</code> channel,
+which in turn is used for writing these events to a Kudu table called <code>stats</code>. We are using
+<code>org.apache.kudu.flume.sink.SimpleKuduEventProducer</code> as the producer. <code>SimpleKuduEventProducer</code> is
+the built-in and default producer, but it&#8217;s implemented as a showcase for how to write Flume
+events into Kudu tables. For any serious functionality we&#8217;d have to write a custom producer. We
+need to make this producer and the <code>KuduSink</code> class available to Flume. We can do that by simply
+copying the <code>kudu-flume-sink-&lt;VERSION&gt;.jar</code> jar file from the Kudu distribution to the
+<code>$FLUME_HOME/plugins.d/kudu-sink/lib</code> directory in the Flume installation. The jar file contains
+<code>KuduSink</code> and all of its dependencies (including Kudu java client classes).</p>
+
+<p>At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
+(<code>agent1.sinks.sink1.masterAddresses = localhost</code>) and which Kudu table should be used for writing
+Flume events to (<code>agent1.sinks.sink1.tableName = stats</code>). The Kudu Flume Sink doesn&#8217;t create this
+table, it has to be created before the Kudu Flume Sink is started.</p>
+
+<p>You may also notice the <code>batchSize</code> parameter. Batch size is used for batching up to that many
+Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
+impact on ingest performance of the Kudu cluster.</p>
+
+<p>Here is a complete list of KuduSink parameters:</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Parameter Name</th>
+      <th>Default</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>masterAddresses</td>
+      <td>N/A</td>
+      <td>Comma-separated list of &#8220;host:port&#8221; pairs of the masters (port optional)</td>
+    </tr>
+    <tr>
+      <td>tableName</td>
+      <td>N/A</td>
+      <td>The name of the table in Kudu to write to</td>
+    </tr>
+    <tr>
+      <td>producer</td>
+      <td>org.apache.kudu.flume.sink.SimpleKuduEventProducer</td>
+      <td>The fully qualified class name of the Kudu event producer the sink should use</td>
+    </tr>
+    <tr>
+      <td>batchSize</td>
+      <td>100</td>
+      <td>Maximum number of events the sink should take from the channel per transaction, if available</td>
+    </tr>
+    <tr>
+      <td>timeoutMillis</td>
+      <td>30000</td>
+      <td>Timeout period for Kudu operations, in milliseconds</td>
+    </tr>
+    <tr>
+      <td>ignoreDuplicateRows</td>
+      <td>true</td>
+      <td>Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu</td>
+    </tr>
+  </tbody>
+</table>
+
+<p>Let&#8217;s take a look at the source code for the built-in producer class:</p>
+
+<pre><code class="language-java">public class SimpleKuduEventProducer implements KuduEventProducer {
+  private byte[] payload;
+  private KuduTable table;
+  private String payloadColumn;
+
+  public SimpleKuduEventProducer(){
+  }
+
+  @Override
+  public void configure(Context context) {
+    payloadColumn = context.getString("payloadColumn","payload");
+  }
+
+  @Override
+  public void configure(ComponentConfiguration conf) {
+  }
+
+  @Override
+  public void initialize(Event event, KuduTable table) {
+    this.payload = event.getBody();
+    this.table = table;
+  }
+
+  @Override
+  public List&lt;Operation&gt; getOperations() throws FlumeException {
+    try {
+      Insert insert = table.newInsert();
+      PartialRow row = insert.getRow();
+      row.addBinary(payloadColumn, payload);
+
+      return Collections.singletonList((Operation) insert);
+    } catch (Exception e){
+      throw new FlumeException("Failed to create Kudu Insert object!", e);
+    }
+  }
+
+  @Override
+  public void close() {
+  }
+}
+</code></pre>
+
+<p><code>SimpleKuduEventProducer</code> implements the <code>org.apache.kudu.flume.sink.KuduEventProducer</code> interface,
+which itself looks like this:</p>
+
+<pre><code class="language-java">public interface KuduEventProducer extends Configurable, ConfigurableComponent {
+  /**
+   * Initialize the event producer.
+   * @param event to be written to Kudu
+   * @param table the KuduTable object used for creating Kudu Operation objects
+   */
+  void initialize(Event event, KuduTable table);
+
+  /**
+   * Get the operations that should be written out to Kudu as a result of this
+   * event. This list is written to Kudu using the Kudu client API.
+   * @return List of {@link org.kududb.client.Operation} which
+   * are written as such to Kudu
+   */
+  List&lt;Operation&gt; getOperations();
+
+  /*
+   * Clean up any state. This will be called when the sink is being stopped.
+   */
+  void close();
+}
+</code></pre>
+
+<p><code>public void configure(Context context)</code> is called when an instance of our producer is instantiated
+by the KuduSink. SimpleKuduEventProducer&#8217;s implementation looks for a producer parameter named
+<code>payloadColumn</code> and uses its value (&#8220;payload&#8221; if not overridden in Flume configuration file) as the
+column which will hold the value of the Flume event payload. If you recall from above, we had
+configured the KuduSink to listen for events generated from the <code>vmstat</code> command. Each output row
+from that command will be stored as a new row containing a <code>payload</code> column in the <code>stats</code> table.
+<code>SimpleKuduEventProducer</code> does not have any configuration parameters, but if it had any we would
+define them by prefixing it with <code>producer.</code> (<code>agent1.sinks.sink1.producer.parameter1</code> for
+example).</p>
+
+<p>The main producer logic resides in the <code>public List&lt;Operation&gt; getOperations()</code> method. In
+SimpleKuduEventProducer&#8217;s implementation we simply insert the binary body of the Flume event into
+the Kudu table. Here we call Kudu&#8217;s <code>newInsert()</code> to initiate an insert, but could have used
+<code>Upsert</code> if updating an existing row was also an option, in fact there&#8217;s another producer
+implementation available for doing just that: <code>SimpleKeyedKuduEventProducer</code>. Most probably you
+will need to write your own custom producer in the real world, but you can base your implementation
+on the built-in ones.</p>
+
+<p>In the future, we plan to add more flexible event producer implementations so that creation of a
+custom event producer is not required to write data to Kudu. See
+<a href="https://gerrit.cloudera.org/#/c/4034/">here</a> for a work-in-progress generic event producer for
+Avro-encoded Events.</p>
+
+<h2 id="conclusion">Conclusion</h2>
+
+<p>Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
+helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
+the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
+disparate sources.</p>
+
+<p><em>Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
+sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
+is included in the Kudu distribution. You can follow him on Twitter at
+<a href="https://twitter.com/ara_e">@ara_e</a>.</em></p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/08/31/intro-flume-kudu-sink.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/08/23/new-range-partitioning-features.html">New Range Partitioning Features in Kudu 0.10</a></h1>
     <p class="meta">Posted 23 Aug 2016 by Dan Burkert</p>
   </header>
@@ -195,27 +509,6 @@ covers ongoing development and news in the Apache Kudu project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a></h1>
-    <p class="meta">Posted 26 Jul 2016 by Jean-Daniel Cryans</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/07/26/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -236,6 +529,8 @@ covers ongoing development and news in the Apache Kudu project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -264,8 +559,6 @@ covers ongoing development and news in the Apache Kudu project.</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/4/index.html
----------------------------------------------------------------------
diff --git a/blog/page/4/index.html b/blog/page/4/index.html
index 32c54ff..c5efcda 100644
--- a/blog/page/4/index.html
+++ b/blog/page/4/index.html
@@ -111,6 +111,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a></h1>
+    <p class="meta">Posted 26 Jul 2016 by Jean-Daniel Cryans</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/07/26/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/07/25/asf-graduation.html">The Apache Software Foundation Announces Apache&reg; Kudu&trade; as a Top-Level Project</a></h1>
     <p class="meta">Posted 25 Jul 2016 by Jean-Daniel Cryans</p>
   </header>
@@ -203,27 +224,6 @@ of 0.9.0 are encouraged to update to the new version at their earliest convenien
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/06/27/weekly-update.html">Apache Kudu (incubating) Weekly Update June 27, 2016</a></h1>
-    <p class="meta">Posted 27 Jun 2016 by Todd Lipcon</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the fifteenth edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu (incubating) project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/06/27/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -244,6 +244,8 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -272,8 +274,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/5/index.html
----------------------------------------------------------------------
diff --git a/blog/page/5/index.html b/blog/page/5/index.html
index 3c5379e..ce7712a 100644
--- a/blog/page/5/index.html
+++ b/blog/page/5/index.html
@@ -111,6 +111,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/06/27/weekly-update.html">Apache Kudu (incubating) Weekly Update June 27, 2016</a></h1>
+    <p class="meta">Posted 27 Jun 2016 by Todd Lipcon</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the fifteenth edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu (incubating) project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/06/27/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/06/24/multi-master-1-0-0.html">Master fault tolerance in Kudu 1.0</a></h1>
     <p class="meta">Posted 24 Jun 2016 by Adar Dembo</p>
   </header>
@@ -196,37 +217,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/06/10/apache-kudu-0-9-0-released.html">Apache Kudu (incubating) 0.9.0 released</a></h1>
-    <p class="meta">Posted 10 Jun 2016 by Jean-Daniel Cryans</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>The Apache Kudu (incubating) team is happy to announce the release of Kudu
-0.9.0!</p>
-
-<p>This latest version adds basic UPSERT functionality and an improved Apache Spark Data Source
-that doesn&#8217;t rely on the MapReduce I/O formats. It also improves Tablet Server
-restart time as well as write performance under high load. Finally, Kudu now enforces
-the specification of a partitioning scheme for new tables.</p>
-
-<ul>
-  <li>Read the detailed <a href="http://kudu.apache.org/releases/0.9.0/docs/release_notes.html">Kudu 0.9.0 release notes</a></li>
-  <li>Download the <a href="http://kudu.apache.org/releases/0.9.0/">Kudu 0.9.0 source release</a></li>
-</ul>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/06/10/apache-kudu-0-9-0-released.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -247,6 +237,8 @@ the specification of a partitioning scheme for new tables.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -275,8 +267,6 @@ the specification of a partitioning scheme for new tables.</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/6/index.html
----------------------------------------------------------------------
diff --git a/blog/page/6/index.html b/blog/page/6/index.html
index ec37fd0..9f0c613 100644
--- a/blog/page/6/index.html
+++ b/blog/page/6/index.html
@@ -111,6 +111,37 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/06/10/apache-kudu-0-9-0-released.html">Apache Kudu (incubating) 0.9.0 released</a></h1>
+    <p class="meta">Posted 10 Jun 2016 by Jean-Daniel Cryans</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>The Apache Kudu (incubating) team is happy to announce the release of Kudu
+0.9.0!</p>
+
+<p>This latest version adds basic UPSERT functionality and an improved Apache Spark Data Source
+that doesn&#8217;t rely on the MapReduce I/O formats. It also improves Tablet Server
+restart time as well as write performance under high load. Finally, Kudu now enforces
+the specification of a partitioning scheme for new tables.</p>
+
+<ul>
+  <li>Read the detailed <a href="http://kudu.apache.org/releases/0.9.0/docs/release_notes.html">Kudu 0.9.0 release notes</a></li>
+  <li>Download the <a href="http://kudu.apache.org/releases/0.9.0/">Kudu 0.9.0 source release</a></li>
+</ul>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/06/10/apache-kudu-0-9-0-released.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/06/06/weekly-update.html">Apache Kudu (incubating) Weekly Update June 6, 2016</a></h1>
     <p class="meta">Posted 06 Jun 2016 by Jean-Daniel Cryans</p>
   </header>
@@ -194,27 +225,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/05/16/weekly-update.html">Apache Kudu (incubating) Weekly Update May 16, 2016</a></h1>
-    <p class="meta">Posted 16 May 2016 by Todd Lipcon</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the ninth edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu (incubating) project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/05/16/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -235,6 +245,8 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -263,8 +275,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/7/index.html
----------------------------------------------------------------------
diff --git a/blog/page/7/index.html b/blog/page/7/index.html
index 249e724..12d3ad2 100644
--- a/blog/page/7/index.html
+++ b/blog/page/7/index.html
@@ -111,6 +111,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/05/16/weekly-update.html">Apache Kudu (incubating) Weekly Update May 16, 2016</a></h1>
+    <p class="meta">Posted 16 May 2016 by Todd Lipcon</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the ninth edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu (incubating) project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/05/16/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/05/09/weekly-update.html">Apache Kudu (incubating) Weekly Update May 9, 2016</a></h1>
     <p class="meta">Posted 09 May 2016 by Jean-Daniel Cryans</p>
   </header>
@@ -191,29 +212,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Predicate Improvements in Kudu 0.8</a></h1>
-    <p class="meta">Posted 19 Apr 2016 by Dan Burkert</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>The recently released Kudu version 0.8 ships with a host of new improvements to
-scan predicates. Performance and usability have been improved, especially for
-tables taking advantage of <a href="http://kudu.apache.org/docs/schema_design.html#data-distribution">advanced partitioning
-options</a>.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -234,6 +232,8 @@ options</a>.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -262,8 +262,6 @@ options</a>.</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/8/index.html
----------------------------------------------------------------------
diff --git a/blog/page/8/index.html b/blog/page/8/index.html
index 22e3366..e836166 100644
--- a/blog/page/8/index.html
+++ b/blog/page/8/index.html
@@ -111,6 +111,29 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Predicate Improvements in Kudu 0.8</a></h1>
+    <p class="meta">Posted 19 Apr 2016 by Dan Burkert</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>The recently released Kudu version 0.8 ships with a host of new improvements to
+scan predicates. Performance and usability have been improved, especially for
+tables taking advantage of <a href="http://kudu.apache.org/docs/schema_design.html#data-distribution">advanced partitioning
+options</a>.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/04/18/weekly-update.html">Apache Kudu (incubating) Weekly Update April 18, 2016</a></h1>
     <p class="meta">Posted 18 Apr 2016 by Todd Lipcon</p>
   </header>
@@ -211,27 +234,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/04/04/weekly-update.html">Apache Kudu (incubating) Weekly Update April 4, 2016</a></h1>
-    <p class="meta">Posted 04 Apr 2016 by Todd Lipcon</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the third edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu (incubating) project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/04/04/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -252,6 +254,8 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -280,8 +284,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/blog/page/9/index.html
----------------------------------------------------------------------
diff --git a/blog/page/9/index.html b/blog/page/9/index.html
index d7533db..159165f 100644
--- a/blog/page/9/index.html
+++ b/blog/page/9/index.html
@@ -111,6 +111,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/04/04/weekly-update.html">Apache Kudu (incubating) Weekly Update April 4, 2016</a></h1>
+    <p class="meta">Posted 04 Apr 2016 by Todd Lipcon</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the third edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu (incubating) project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/04/04/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/03/28/weekly-update.html">Apache Kudu (incubating) Weekly Update March 28, 2016</a></h1>
     <p class="meta">Posted 28 Mar 2016 by Todd Lipcon</p>
   </header>
@@ -223,6 +244,8 @@ part of the ASF Incubator, version 0.7.0!</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2017/04/19/apache-kudu-1-3-1-released.html">Apache Kudu 1.3.1 released</a> </li>
+    
       <li> <a href="/2017/03/20/apache-kudu-1-3-0-released.html">Apache Kudu 1.3.0 released</a> </li>
     
       <li> <a href="/2017/01/20/apache-kudu-1-2-0-released.html">Apache Kudu 1.2.0 released</a> </li>
@@ -251,8 +274,6 @@ part of the ASF Incubator, version 0.7.0!</p>
     
       <li> <a href="/2016/08/08/weekly-update.html">Apache Kudu Weekly Update August 8th, 2016</a> </li>
     
-      <li> <a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/bcbdb4d8/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index e40e144..8a9ef30 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,31 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-04-18T22:31:45-07:00</updated><id>/</id><entry><title>Apache Kudu 1.3.0 released</title><link href="/2017/03/20/apache-kudu-1-3-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.3.0 released" /><published>2017-03-20T00:00:00-07:00</published><updated>2017-03-20T00:00:00-07:00</updated><id>/2017/03/20/apache-kudu-1-3-0-released</id><content type="html" xml:base="/2017/03/20/apache-kudu-1-3-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.3.0!&lt;/p&gt;
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-04-19T10:10:47-07:00</updated><id>/</id><entry><title>Apache Kudu 1.3.1 released</title><link href="/2017/04/19/apache-kudu-1-3-1-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.3.1 released" /><published>2017-04-19T00:00:00-07:00</published><updated>2017-04-19T00:00:00-07:00</updated><id>/2017/04/19/apache-kudu-1-3-1-released</id><content type="html" xml:base="/2017/04/19/apache-kudu-1-3-1-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.3.1!&lt;/p&gt;
+
+&lt;p&gt;Apache Kudu 1.3.1 is a bug fix release which fixes critical issues discovered
+in Apache Kudu 1.3.0. In particular, this fixes a bug in which data could be
+incorrectly deleted after certain sequences of node failures. Several other
+bugs are also fixed. See the release notes for details.&lt;/p&gt;
+
+&lt;p&gt;Users of Kudu 1.3.0 are encouraged to upgrade to 1.3.1 immediately.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Download the &lt;a href=&quot;/releases/1.3.1/&quot;&gt;Kudu 1.3.1 source release&lt;/a&gt;&lt;/li&gt;
+  &lt;li&gt;Convenience binary artifacts for the Java client and various Java
+integrations (eg Spark, Flume) are also now available via the ASF Maven
+repository.&lt;/li&gt;
+&lt;/ul&gt;</content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.3.1!
+
+Apache Kudu 1.3.1 is a bug fix release which fixes critical issues discovered
+in Apache Kudu 1.3.0. In particular, this fixes a bug in which data could be
+incorrectly deleted after certain sequences of node failures. Several other
+bugs are also fixed. See the release notes for details.
+
+Users of Kudu 1.3.0 are encouraged to upgrade to 1.3.1 immediately.
+
+
+  Download the Kudu 1.3.1 source release
+  Convenience binary artifacts for the Java client and various Java
+integrations (eg Spark, Flume) are also now available via the ASF Maven
+repository.</summary></entry><entry><title>Apache Kudu 1.3.0 released</title><link href="/2017/03/20/apache-kudu-1-3-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.3.0 released" /><published>2017-03-20T00:00:00-07:00</published><updated>2017-03-20T00:00:00-07:00</updated><id>/2017/03/20/apache-kudu-1-3-0-released</id><content type="html" xml:base="/2017/03/20/apache-kudu-1-3-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.3.0!&lt;/p&gt;
 
 &lt;p&gt;Apache Kudu 1.3 is a minor release which adds various new features,
 improvements, bug fixes, and optimizations on top of Kudu
@@ -748,592 +775,4 @@ incubating to a Top Level Apache project. I can’t express enough how grateful
 am for the amount of support I got from the Kudu team, from the intern
 coordinators, and from the Cloudera community as a whole.&lt;/p&gt;</content><author><name>Andrew Wong</name></author><summary>I had the pleasure of interning with the Apache Kudu team at Cloudera this
 summer. This project was my summer contribution to Kudu: a restructuring of the
-scan path to speed up queries.</summary></entry><entry><title>An Introduction to the Flume Kudu Sink</title><link href="/2016/08/31/intro-flume-kudu-sink.html" rel="alternate" type="text/html" title="An Introduction to the Flume Kudu Sink" /><published>2016-08-31T00:00:00-07:00</published><updated>2016-08-31T00:00:00-07:00</updated><id>/2016/08/31/intro-flume-kudu-sink</id><content type="html" xml:base="/2016/08/31/intro-flume-kudu-sink.html">&lt;p&gt;This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered
-using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.&lt;/p&gt;
-
-&lt;h2 id=&quot;why-kudu&quot;&gt;Why Kudu&lt;/h2&gt;
-
-&lt;p&gt;Traditionally in the Hadoop ecosystem we’ve dealt with various &lt;em&gt;batch processing&lt;/em&gt; technologies such
-as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
-Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
-process the whole data set in batches, again and again, as soon as new data gets added. Things get
-really complicated when a few such tasks need to get chained together, or when the same data set
-needs to be processed in various ways by different jobs, while all compete for the shared cluster
-resources.&lt;/p&gt;
-
-&lt;p&gt;The opposite of this approach is &lt;em&gt;stream processing&lt;/em&gt;: process the data as soon as it arrives, not
-in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
-this possible. But writing streaming services is not trivial. The streaming systems are becoming
-more and more capable and support more complex constructs, but they are not yet easy to use. All
-queries and processes need to be carefully planned and implemented.&lt;/p&gt;
-
-&lt;p&gt;To summarize, &lt;em&gt;batch processing&lt;/em&gt; is:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;file-based&lt;/li&gt;
-  &lt;li&gt;a paradigm that processes large chunks of data as a group&lt;/li&gt;
-  &lt;li&gt;high latency and high throughput, both for ingest and query&lt;/li&gt;
-  &lt;li&gt;typically easy to program, but hard to orchestrate&lt;/li&gt;
-  &lt;li&gt;well suited for writing ad-hoc queries, although they are typically high latency&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;While &lt;em&gt;stream processing&lt;/em&gt; is:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;a totally different paradigm, which involves single events and time windows instead of large groups of events&lt;/li&gt;
-  &lt;li&gt;still file-based and not a long-term database&lt;/li&gt;
-  &lt;li&gt;not batch-oriented, but incremental&lt;/li&gt;
-  &lt;li&gt;ultra-fast ingest and ultra-fast query (query results basically pre-calculated)&lt;/li&gt;
-  &lt;li&gt;not so easy to program, relatively easy to orchestrate&lt;/li&gt;
-  &lt;li&gt;impossible to write ad-hoc queries&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;And a Kudu-based &lt;em&gt;near real-time&lt;/em&gt; approach is:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;flexible and expressive, thanks to SQL support via Apache Impala (incubating)&lt;/li&gt;
-  &lt;li&gt;a table-oriented, mutable data store that feels like a traditional relational database&lt;/li&gt;
-  &lt;li&gt;very easy to program, you can even pretend it’s good old MySQL&lt;/li&gt;
-  &lt;li&gt;low-latency and relatively high throughput, both for ingest and query&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive
-amounts of data, run machine learning algorithms and generate reports. When we created our current
-architecture two years ago we decided to opt for a database as the backbone of our system. That
-database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS,
-quite similar to HBase but with some important improvements such as cell level security and ease
-of deployment and management. To enable querying of this data for quite complex reporting and
-analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
-by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
-architecture has served us well, but there were a few problems:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;we need to ingest even more massive volumes of data in real-time&lt;/li&gt;
-  &lt;li&gt;we need to perform complex machine-learning calculations on even larger data-sets&lt;/li&gt;
-  &lt;li&gt;we need to support ad-hoc queries, plus long-term data warehouse functionality&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;So, we’ve started gradually moving the core machine-learning pipeline to a streaming based
-solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
-would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
-the machine learning pipeline ingests and processes real-time data, we store a copy of the same
-ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our &lt;em&gt;data warehouse&lt;/em&gt;. By
-using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s
-super-fast query engine.&lt;/p&gt;
-
-&lt;p&gt;But how would we make sure data is reliably ingested into the streaming pipeline &lt;em&gt;and&lt;/em&gt; the
-Kudu-based data warehouse? This is where Apache Flume comes in.&lt;/p&gt;
-
-&lt;h2 id=&quot;why-flume&quot;&gt;Why Flume&lt;/h2&gt;
-
-&lt;p&gt;According to their &lt;a href=&quot;http://flume.apache.org/&quot;&gt;website&lt;/a&gt; “Flume is a distributed, reliable, and
-available service for efficiently collecting, aggregating, and moving large amounts of log data.
-It has a simple and flexible architecture based on streaming data flows. It is robust and fault
-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you
-can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
-clusters.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad&quot; alt=&quot;png&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Flume has an extensible architecture. An instance of Flume, called an &lt;em&gt;agent&lt;/em&gt;, can have multiple
-&lt;em&gt;channels&lt;/em&gt;, with each having multiple &lt;em&gt;sources&lt;/em&gt; and &lt;em&gt;sinks&lt;/em&gt; of various types. Sources queue data
-in channels, which in turn write out data to sinks. Such &lt;em&gt;pipelines&lt;/em&gt; can be chained together to
-create even more complex ones. There may be more than one agent and agents can be configured to
-support failover and recovery.&lt;/p&gt;
-
-&lt;p&gt;Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
-default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
-File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
-source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
-data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.&lt;/p&gt;
-
-&lt;p&gt;In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to
-write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
-release and the source code can be found &lt;a href=&quot;https://github.com/apache/kudu/tree/master/java/kudu-flume-sink&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
-
-&lt;h2 id=&quot;configuring-the-kudu-flume-sink&quot;&gt;Configuring the Kudu Flume Sink&lt;/h2&gt;
-
-&lt;p&gt;Here is a sample flume configuration file:&lt;/p&gt;
-
-&lt;pre&gt;&lt;code&gt;agent1.sources  = source1
-agent1.channels = channel1
-agent1.sinks = sink1
-
-agent1.sources.source1.type = exec
-agent1.sources.source1.command = /usr/bin/vmstat 1
-agent1.sources.source1.channels = channel1
-
-agent1.channels.channel1.type = memory
-agent1.channels.channel1.capacity = 10000
-agent1.channels.channel1.transactionCapacity = 1000
-
-agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
-agent1.sinks.sink1.masterAddresses = localhost
-agent1.sinks.sink1.tableName = stats
-agent1.sinks.sink1.channel = channel1
-agent1.sinks.sink1.batchSize = 50
-agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
-&lt;/code&gt;&lt;/pre&gt;
-
-&lt;p&gt;We define a source called &lt;code&gt;source1&lt;/code&gt; which simply executes a &lt;code&gt;vmstat&lt;/code&gt; command to continuously generate
-virtual memory statistics for the machine and queue events into an in-memory &lt;code&gt;channel1&lt;/code&gt; channel,
-which in turn is used for writing these events to a Kudu table called &lt;code&gt;stats&lt;/code&gt;. We are using
-&lt;code&gt;org.apache.kudu.flume.sink.SimpleKuduEventProducer&lt;/code&gt; as the producer. &lt;code&gt;SimpleKuduEventProducer&lt;/code&gt; is
-the built-in and default producer, but it’s implemented as a showcase for how to write Flume
-events into Kudu tables. For any serious functionality we’d have to write a custom producer. We
-need to make this producer and the &lt;code&gt;KuduSink&lt;/code&gt; class available to Flume. We can do that by simply
-copying the &lt;code&gt;kudu-flume-sink-&amp;lt;VERSION&amp;gt;.jar&lt;/code&gt; jar file from the Kudu distribution to the
-&lt;code&gt;$FLUME_HOME/plugins.d/kudu-sink/lib&lt;/code&gt; directory in the Flume installation. The jar file contains
-&lt;code&gt;KuduSink&lt;/code&gt; and all of its dependencies (including Kudu java client classes).&lt;/p&gt;
-
-&lt;p&gt;At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
-(&lt;code&gt;agent1.sinks.sink1.masterAddresses = localhost&lt;/code&gt;) and which Kudu table should be used for writing
-Flume events to (&lt;code&gt;agent1.sinks.sink1.tableName = stats&lt;/code&gt;). The Kudu Flume Sink doesn’t create this
-table, it has to be created before the Kudu Flume Sink is started.&lt;/p&gt;
-
-&lt;p&gt;You may also notice the &lt;code&gt;batchSize&lt;/code&gt; parameter. Batch size is used for batching up to that many
-Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
-impact on ingest performance of the Kudu cluster.&lt;/p&gt;
-
-&lt;p&gt;Here is a complete list of KuduSink parameters:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th&gt;Parameter Name&lt;/th&gt;
-      &lt;th&gt;Default&lt;/th&gt;
-      &lt;th&gt;Description&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td&gt;masterAddresses&lt;/td&gt;
-      &lt;td&gt;N/A&lt;/td&gt;
-      &lt;td&gt;Comma-separated list of “host:port” pairs of the masters (port optional)&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;tableName&lt;/td&gt;
-      &lt;td&gt;N/A&lt;/td&gt;
-      &lt;td&gt;The name of the table in Kudu to write to&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;producer&lt;/td&gt;
-      &lt;td&gt;org.apache.kudu.flume.sink.SimpleKuduEventProducer&lt;/td&gt;
-      &lt;td&gt;The fully qualified class name of the Kudu event producer the sink should use&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;batchSize&lt;/td&gt;
-      &lt;td&gt;100&lt;/td&gt;
-      &lt;td&gt;Maximum number of events the sink should take from the channel per transaction, if available&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;timeoutMillis&lt;/td&gt;
-      &lt;td&gt;30000&lt;/td&gt;
-      &lt;td&gt;Timeout period for Kudu operations, in milliseconds&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td&gt;ignoreDuplicateRows&lt;/td&gt;
-      &lt;td&gt;true&lt;/td&gt;
-      &lt;td&gt;Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Let’s take a look at the source code for the built-in producer class:&lt;/p&gt;
-
-&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;public class SimpleKuduEventProducer implements KuduEventProducer {
-  private byte[] payload;
-  private KuduTable table;
-  private String payloadColumn;
-
-  public SimpleKuduEventProducer(){
-  }
-
-  @Override
-  public void configure(Context context) {
-    payloadColumn = context.getString(&quot;payloadColumn&quot;,&quot;payload&quot;);
-  }
-
-  @Override
-  public void configure(ComponentConfiguration conf) {
-  }
-
-  @Override
-  public void initialize(Event event, KuduTable table) {
-    this.payload = event.getBody();
-    this.table = table;
-  }
-
-  @Override
-  public List&amp;lt;Operation&amp;gt; getOperations() throws FlumeException {
-    try {
-      Insert insert = table.newInsert();
-      PartialRow row = insert.getRow();
-      row.addBinary(payloadColumn, payload);
-
-      return Collections.singletonList((Operation) insert);
-    } catch (Exception e){
-      throw new FlumeException(&quot;Failed to create Kudu Insert object!&quot;, e);
-    }
-  }
-
-  @Override
-  public void close() {
-  }
-}
-&lt;/code&gt;&lt;/pre&gt;
-
-&lt;p&gt;&lt;code&gt;SimpleKuduEventProducer&lt;/code&gt; implements the &lt;code&gt;org.apache.kudu.flume.sink.KuduEventProducer&lt;/code&gt; interface,
-which itself looks like this:&lt;/p&gt;
-
-&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;public interface KuduEventProducer extends Configurable, ConfigurableComponent {
-  /**
-   * Initialize the event producer.
-   * @param event to be written to Kudu
-   * @param table the KuduTable object used for creating Kudu Operation objects
-   */
-  void initialize(Event event, KuduTable table);
-
-  /**
-   * Get the operations that should be written out to Kudu as a result of this
-   * event. This list is written to Kudu using the Kudu client API.
-   * @return List of {@link org.kududb.client.Operation} which
-   * are written as such to Kudu
-   */
-  List&amp;lt;Operation&amp;gt; getOperations();
-
-  /*
-   * Clean up any state. This will be called when the sink is being stopped.
-   */
-  void close();
-}
-&lt;/code&gt;&lt;/pre&gt;
-
-&lt;p&gt;&lt;code&gt;public void configure(Context context)&lt;/code&gt; is called when an instance of our producer is instantiated
-by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named
-&lt;code&gt;payloadColumn&lt;/code&gt; and uses its value (“payload” if not overridden in Flume configuration file) as the
-column which will hold the value of the Flume event payload. If you recall from above, we had
-configured the KuduSink to listen for events generated from the &lt;code&gt;vmstat&lt;/code&gt; command. Each output row
-from that command will be stored as a new row containing a &lt;code&gt;payload&lt;/code&gt; column in the &lt;code&gt;stats&lt;/code&gt; table.
-&lt;code&gt;SimpleKuduEventProducer&lt;/code&gt; does not have any configuration parameters, but if it had any we would
-define them by prefixing it with &lt;code&gt;producer.&lt;/code&gt; (&lt;code&gt;agent1.sinks.sink1.producer.parameter1&lt;/code&gt; for
-example).&lt;/p&gt;
-
-&lt;p&gt;The main producer logic resides in the &lt;code&gt;public List&amp;lt;Operation&amp;gt; getOperations()&lt;/code&gt; method. In
-SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into
-the Kudu table. Here we call Kudu’s &lt;code&gt;newInsert()&lt;/code&gt; to initiate an insert, but could have used
-&lt;code&gt;Upsert&lt;/code&gt; if updating an existing row was also an option, in fact there’s another producer
-implementation available for doing just that: &lt;code&gt;SimpleKeyedKuduEventProducer&lt;/code&gt;. Most probably you
-will need to write your own custom producer in the real world, but you can base your implementation
-on the built-in ones.&lt;/p&gt;
-
-&lt;p&gt;In the future, we plan to add more flexible event producer implementations so that creation of a
-custom event producer is not required to write data to Kudu. See
-&lt;a href=&quot;https://gerrit.cloudera.org/#/c/4034/&quot;&gt;here&lt;/a&gt; for a work-in-progress generic event producer for
-Avro-encoded Events.&lt;/p&gt;
-
-&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
-
-&lt;p&gt;Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
-helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
-the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
-disparate sources.&lt;/p&gt;
-
-&lt;p&gt;&lt;em&gt;Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
-sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
-is included in the Kudu distribution. You can follow him on Twitter at
-&lt;a href=&quot;https://twitter.com/ara_e&quot;&gt;@ara_e&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content><author><name>Ara Abrahamian</name></author><summary>This post discusses the Kudu Flume Sink. First, I&amp;#8217;ll give some background on why we considered
-using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.
-
-Why Kudu
-
-Traditionally in the Hadoop ecosystem we&amp;#8217;ve dealt with various batch processing technologies such
-as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
-Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
-process the whole data set in batches, again and again, as soon as new data gets added. Things get
-really complicated when a few such tasks need to get chained together, or when the same data set
-needs to be processed in various ways by different jobs, while all compete for the shared cluster
-resources.
-
-The opposite of this approach is stream processing: process the data as soon as it arrives, not
-in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
-this possible. But writing streaming services is not trivial. The streaming systems are becoming
-more and more capable and support more complex constructs, but they are not yet easy to use. All
-queries and processes need to be carefully planned and implemented.
-
-To summarize, batch processing is:
-
-
-  file-based
-  a paradigm that processes large chunks of data as a group
-  high latency and high throughput, both for ingest and query
-  typically easy to program, but hard to orchestrate
-  well suited for writing ad-hoc queries, although they are typically high latency
-
-
-While stream processing is:
-
-
-  a totally different paradigm, which involves single events and time windows instead of large groups of events
-  still file-based and not a long-term database
-  not batch-oriented, but incremental
-  ultra-fast ingest and ultra-fast query (query results basically pre-calculated)
-  not so easy to program, relatively easy to orchestrate
-  impossible to write ad-hoc queries
-
-
-And a Kudu-based near real-time approach is:
-
-
-  flexible and expressive, thanks to SQL support via Apache Impala (incubating)
-  a table-oriented, mutable data store that feels like a traditional relational database
-  very easy to program, you can even pretend it&amp;#8217;s good old MySQL
-  low-latency and relatively high throughput, both for ingest and query
-
-
-At Argyle Data, we&amp;#8217;re dealing with complex fraud detection scenarios. We need to ingest massive
-amounts of data, run machine learning algorithms and generate reports. When we created our current
-architecture two years ago we decided to opt for a database as the backbone of our system. That
-database is Apache Accumulo. It&amp;#8217;s a key-value based database which runs on top of Hadoop HDFS,
-quite similar to HBase but with some important improvements such as cell level security and ease
-of deployment and management. To enable querying of this data for quite complex reporting and
-analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
-by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
-architecture has served us well, but there were a few problems:
-
-
-  we need to ingest even more massive volumes of data in real-time
-  we need to perform complex machine-learning calculations on even larger data-sets
-  we need to support ad-hoc queries, plus long-term data warehouse functionality
-
-
-So, we&amp;#8217;ve started gradually moving the core machine-learning pipeline to a streaming based
-solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
-would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
-the machine learning pipeline ingests and processes real-time data, we store a copy of the same
-ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our data warehouse. By
-using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala&amp;#8217;s
-super-fast query engine.
-
-But how would we make sure data is reliably ingested into the streaming pipeline and the
-Kudu-based data warehouse? This is where Apache Flume comes in.
-
-Why Flume
-
-According to their website &amp;#8220;Flume is a distributed, reliable, and
-available service for efficiently collecting, aggregating, and moving large amounts of log data.
-It has a simple and flexible architecture based on streaming data flows. It is robust and fault
-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.&amp;#8221; As you
-can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
-clusters.
-
-
-
-Flume has an extensible architecture. An instance of Flume, called an agent, can have multiple
-channels, with each having multiple sources and sinks of various types. Sources queue data
-in channels, which in turn write out data to sinks. Such pipelines can be chained together to
-create even more complex ones. There may be more than one agent and agents can be configured to
-support failover and recovery.
-
-Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
-default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
-File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
-source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
-data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.
-
-In the rest of this post I&amp;#8217;ll go over the Kudu Flume sink and show you how to configure Flume to
-write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
-release and the source code can be found here.
-
-Configuring the Kudu Flume Sink
-
-Here is a sample flume configuration file:
-
-agent1.sources  = source1
-agent1.channels = channel1
-agent1.sinks = sink1
-
-agent1.sources.source1.type = exec
-agent1.sources.source1.command = /usr/bin/vmstat 1
-agent1.sources.source1.channels = channel1
-
-agent1.channels.channel1.type = memory
-agent1.channels.channel1.capacity = 10000
-agent1.channels.channel1.transactionCapacity = 1000
-
-agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
-agent1.sinks.sink1.masterAddresses = localhost
-agent1.sinks.sink1.tableName = stats
-agent1.sinks.sink1.channel = channel1
-agent1.sinks.sink1.batchSize = 50
-agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
-
-
-We define a source called source1 which simply executes a vmstat command to continuously generate
-virtual memory statistics for the machine and queue events into an in-memory channel1 channel,
-which in turn is used for writing these events to a Kudu table called stats. We are using
-org.apache.kudu.flume.sink.SimpleKuduEventProducer as the producer. SimpleKuduEventProducer is
-the built-in and default producer, but it&amp;#8217;s implemented as a showcase for how to write Flume
-events into Kudu tables. For any serious functionality we&amp;#8217;d have to write a custom producer. We
-need to make this producer and the KuduSink class available to Flume. We can do that by simply
-copying the kudu-flume-sink-&amp;lt;VERSION&amp;gt;.jar jar file from the Kudu distribution to the
-$FLUME_HOME/plugins.d/kudu-sink/lib directory in the Flume installation. The jar file contains
-KuduSink and all of its dependencies (including Kudu java client classes).
-
-At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
-(agent1.sinks.sink1.masterAddresses = localhost) and which Kudu table should be used for writing
-Flume events to (agent1.sinks.sink1.tableName = stats). The Kudu Flume Sink doesn&amp;#8217;t create this
-table, it has to be created before the Kudu Flume Sink is started.
-
-You may also notice the batchSize parameter. Batch size is used for batching up to that many
-Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
-impact on ingest performance of the Kudu cluster.
-
-Here is a complete list of KuduSink parameters:
-
-
-  
-    
-      Parameter Name
-      Default
-      Description
-    
-  
-  
-    
-      masterAddresses
-      N/A
-      Comma-separated list of &amp;#8220;host:port&amp;#8221; pairs of the masters (port optional)
-    
-    
-      tableName
-      N/A
-      The name of the table in Kudu to write to
-    
-    
-      producer
-      org.apache.kudu.flume.sink.SimpleKuduEventProducer
-      The fully qualified class name of the Kudu event producer the sink should use
-    
-    
-      batchSize
-      100
-      Maximum number of events the sink should take from the channel per transaction, if available
-    
-    
-      timeoutMillis
-      30000
-      Timeout period for Kudu operations, in milliseconds
-    
-    
-      ignoreDuplicateRows
-      true
-      Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu
-    
-  
-
-
-Let&amp;#8217;s take a look at the source code for the built-in producer class:
-
-public class SimpleKuduEventProducer implements KuduEventProducer {
-  private byte[] payload;
-  private KuduTable table;
-  private String payloadColumn;
-
-  public SimpleKuduEventProducer(){
-  }
-
-  @Override
-  public void configure(Context context) {
-    payloadColumn = context.getString(&quot;payloadColumn&quot;,&quot;payload&quot;);
-  }
-
-  @Override
-  public void configure(ComponentConfiguration conf) {
-  }
-
-  @Override
-  public void initialize(Event event, KuduTable table) {
-    this.payload = event.getBody();
-    this.table = table;
-  }
-
-  @Override
-  public List&amp;lt;Operation&amp;gt; getOperations() throws FlumeException {
-    try {
-      Insert insert = table.newInsert();
-      PartialRow row = insert.getRow();
-      row.addBinary(payloadColumn, payload);
-
-      return Collections.singletonList((Operation) insert);
-    } catch (Exception e){
-      throw new FlumeException(&quot;Failed to create Kudu Insert object!&quot;, e);
-    }
-  }
-
-  @Override
-  public void close() {
-  }
-}
-
-
-SimpleKuduEventProducer implements the org.apache.kudu.flume.sink.KuduEventProducer interface,
-which itself looks like this:
-
-public interface KuduEventProducer extends Configurable, ConfigurableComponent {
-  /**
-   * Initialize the event producer.
-   * @param event to be written to Kudu
-   * @param table the KuduTable object used for creating Kudu Operation objects
-   */
-  void initialize(Event event, KuduTable table);
-
-  /**
-   * Get the operations that should be written out to Kudu as a result of this
-   * event. This list is written to Kudu using the Kudu client API.
-   * @return List of {@link org.kududb.client.Operation} which
-   * are written as such to Kudu
-   */
-  List&amp;lt;Operation&amp;gt; getOperations();
-
-  /*
-   * Clean up any state. This will be called when the sink is being stopped.
-   */
-  void close();
-}
-
-
-public void configure(Context context) is called when an instance of our producer is instantiated
-by the KuduSink. SimpleKuduEventProducer&amp;#8217;s implementation looks for a producer parameter named
-payloadColumn and uses its value (&amp;#8220;payload&amp;#8221; if not overridden in Flume configuration file) as the
-column which will hold the value of the Flume event payload. If you recall from above, we had
-configured the KuduSink to listen for events generated from the vmstat command. Each output row
-from that command will be stored as a new row containing a payload column in the stats table.
-SimpleKuduEventProducer does not have any configuration parameters, but if it had any we would
-define them by prefixing it with producer. (agent1.sinks.sink1.producer.parameter1 for
-example).
-
-The main producer logic resides in the public List&amp;lt;Operation&amp;gt; getOperations() method. In
-SimpleKuduEventProducer&amp;#8217;s implementation we simply insert the binary body of the Flume event into
-the Kudu table. Here we call Kudu&amp;#8217;s newInsert() to initiate an insert, but could have used
-Upsert if updating an existing row was also an option, in fact there&amp;#8217;s another producer
-implementation available for doing just that: SimpleKeyedKuduEventProducer. Most probably you
-will need to write your own custom producer in the real world, but you can base your implementation
-on the built-in ones.
-
-In the future, we plan to add more flexible event producer implementations so that creation of a
-custom event producer is not required to write data to Kudu. See
-here for a work-in-progress generic event producer for
-Avro-encoded Events.
-
-Conclusion
-
-Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
-helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
-the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
-disparate sources.
-
-Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
-sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
-is included in the Kudu distribution. You can follow him on Twitter at
-@ara_e.</summary></entry></feed>
+scan path to speed up queries.</summary></entry></feed>


Mime
View raw message