kylin-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shaofeng...@apache.org
Subject kylin git commit: minor update on the blog
Date Tue, 18 Oct 2016 14:25:46 GMT
Repository: kylin
Updated Branches:
  refs/heads/document 38cd798d0 -> 6e3a4bc00


minor update on the blog


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/6e3a4bc0
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/6e3a4bc0
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/6e3a4bc0

Branch: refs/heads/document
Commit: 6e3a4bc004ca2bbb050d3851cdac4b5a21794dd4
Parents: 38cd798
Author: shaofengshi <shaofengshi@apache.org>
Authored: Tue Oct 18 22:25:37 2016 +0800
Committer: shaofengshi <shaofengshi@apache.org>
Committed: Tue Oct 18 22:25:37 2016 +0800

----------------------------------------------------------------------
 .../_posts/blog/2016-10-18-new-nrt-streaming.md   | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/6e3a4bc0/website/_posts/blog/2016-10-18-new-nrt-streaming.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2016-10-18-new-nrt-streaming.md b/website/_posts/blog/2016-10-18-new-nrt-streaming.md
index e1cb845..ad75636 100644
--- a/website/_posts/blog/2016-10-18-new-nrt-streaming.md
+++ b/website/_posts/blog/2016-10-18-new-nrt-streaming.md
@@ -18,16 +18,16 @@ While, that implementation was marked as "experimental" because it has
the follo
 
  * Others: hard to recover from accident, difficult to maintain the code, etc.
 
-To overcome these limitations, the Apache Kylin team developed the new streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726))
with Kafka 0.10 API, it has been tested internally for some time, will release to public soon.
+To overcome these limitations, the Apache Kylin team developed the new streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726))
with Kafka 0.10, it has been tested internally for some time, will release to public soon.
 
-The new design is a perfect implementation under Kylin 1.5's "Plug-in" architecture: treat
Kafka topic as a "Data Source" like Hive table, using an adapter to extract the data to HDFS;
the next steps are almost the same as from Hive. Figure 1 is a high level architecture of
the new design.
+The new design is a perfect implementation under Kylin 1.5's "plug-in" architecture: treat
Kafka topic as a "Data Source" like Hive table, using an adapter to extract the data to HDFS;
the next steps are almost the same as other cubes. Figure 1 is a high level architecture of
the new design.
 
 
 ![Kylin New Streaming Framework Architecture](/images/blog/new-streaming.png)
 
-The adapter to read Kafka messages is modified from [kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader),
which is open sourced under Apache License V2.0; it starts a mapper for each Kafka partition,
reading and then saving the messages to HDFS; in next steps Kylin will be able to leverage
existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.

+The adapter to read Kafka messages is modified from [kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader),
the author Michal Harish open sourced it under Apache License V2.0; it starts a mapper for
each Kafka partition, reading and then saving the messages to HDFS; so Kylin will be able
to leverage existing framework like MR to do the processing, this makes the solution scalable
and fault-tolerant. 
 
-To overcome the "data loss" problem, Kylin adds the start/end offset information on each
Cube segment, and then use the offsets as the partition value (no overlap is allowed); this
ensures no data be lost and 1 message be consumed at most once. To let the late/early message
can be queried, Cube segments allow overlap for the partition time dimension: Kylin will scan
all segments which include the queried time. Figure 2 illurates this.
+To overcome the "data loss" limitation, Kylin adds the start/end offset information on each
Cube segment, and then use the offsets as the partition value (no overlap allowed); this ensures
no data be lost and 1 message be consumed at most once. To let the late/early message can
be queried, Cube segments allow overlap for the partition time dimension: each segment has
a "min" date/time and a "max" date/time; Kylin will scan all segments which matched with the
queried time scope. Figure 2 illurates this.
 
 ![Use Offset to Cut Segments](/images/blog/offset-as-partition-value.png)
 
@@ -39,18 +39,20 @@ Other changes/enhancements are made in the new streaming:
  * Add REST API to trigger streaming cube's building
  * Add REST API to check and fill the segment holes
 
-The integration test result shows big improvements than the previous version:
+The integration test result is promising:
 
  * Scalability: it can easily process up to hundreds of million records in one build; 
- * Flexibility: trigger the build at any time with the frequency you want, e.g: every 5 minutes
in day and every hour in night; Kylin manages the offsets so it can resume from the last position;
- * Stability: pretty stable, no OutOfMemory error;
+ * Flexibility: you can trigger the build at any time, with the frequency you want; for example:
every 5 minutes in day time but every hour in night time, and even pause when you need do
a maintenance; Kylin manages the offsets so it can automatically continue from the last position;
+ * Stability: pretty stable, no OutOfMemoryError;
  * Management: user can check all jobs' status through Kylin's "Monitor" page or REST API;

  * Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams),
10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval
is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build
finishes in around 4 minutes;
 
 
-Here are a couple of screenshots in this test:
+Here are a couple of screenshots in this test, we may compose it as a step-by-step tutorial
in the future:
 ![Streaming Job Monitoring](/images/blog/streaming-monitor.png)
 
 ![Streaming Adapter](/images/blog/streaming-adapter.png)
 
 ![Streaming Twitter Sample](/images/blog/streaming-twitter.png)
+
+In short, this is a more robust Near Real Time Streaming OLAP solution (compared with the
previous version). Nextly, the Apache Kylin team will move toward a Real Time engine. 


Mime
View raw message