kylin-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lid...@apache.org
Subject svn commit: r1790118 - in /kylin/site: ./ blog/ blog/2017/04/ blog/2017/04/01/ blog/2017/04/01/percentile-measure/ images/blog/
Date Tue, 04 Apr 2017 14:00:32 GMT
Author: lidong
Date: Tue Apr  4 14:00:31 2017
New Revision: 1790118

URL: http://svn.apache.org/viewvc?rev=1790118&view=rev
Log:
add blog about percentile measure

Added:
    kylin/site/blog/2017/04/
    kylin/site/blog/2017/04/01/
    kylin/site/blog/2017/04/01/percentile-measure/
    kylin/site/blog/2017/04/01/percentile-measure/index.html
    kylin/site/images/blog/percentile_1.png   (with props)
    kylin/site/images/blog/percentile_2.png   (with props)
    kylin/site/images/blog/percentile_3.png   (with props)
Modified:
    kylin/site/blog/index.html
    kylin/site/feed.xml

Added: kylin/site/blog/2017/04/01/percentile-measure/index.html
URL: http://svn.apache.org/viewvc/kylin/site/blog/2017/04/01/percentile-measure/index.html?rev=1790118&view=auto
==============================================================================
--- kylin/site/blog/2017/04/01/percentile-measure/index.html (added)
+++ kylin/site/blog/2017/04/01/percentile-measure/index.html Tue Apr  4 14:00:31 2017
@@ -0,0 +1,294 @@
+<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+<!doctype html>
+<html>
+	<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+
+  <title>Apache Kylin | A new measure for Percentile precalculation</title>
+  <meta name="description" content="Introduction">
+  <meta name="author"      content="Apache Kylin">
+  <link rel="shortcut icon" href="fav.png" type="image/png">
+
+
+
+<link rel="stylesheet" href="/assets/css/animate.css">
+<!-- Bootstrap -->
+<link rel="stylesheet" href="/assets/css/bootstrap.min.css">
+
+<!-- Fonts -->
+<!-- <link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Alice|Open+Sans:400,300,700">
-->
+
+<!-- Icons -->
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+
+  <!-- Custom styles -->
+  <link rel="stylesheet" href="/assets/css/styles.css">
+  <link rel="stylesheet" href="/assets/css/docs.css">
+  <link rel="stylesheet" href="/assets/css/pygments.css">
+
+  <link rel="canonical" href="http://kylin.apache.org/blog/2017/04/01/percentile-measure/">
+  <link rel="alternate" type="application/rss+xml" title="Apache Kylin" href="http://kylin.apache.org/feed.xml"
/>
+
+<!--[if lt IE 9]> <script src="assets/js/html5shiv.js"></script> <![endif]-->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+  //oringal tracker for kylin.io
+  ga('create', 'UA-55534813-1', 'auto');
+  //new tracker for kylin.apache.org
+  ga('create', 'UA-55534813-2', 'auto', {'name':'toplevel'});
+
+  ga('send', 'pageview');
+  ga('toplevel.send', 'pageview');
+
+
+</script>
+<script type="text/javascript" src="/assets/js/jquery-1.9.1.min.js"></script>
+<script type="text/javascript" src="/assets/js/nside.js"></script> </script>
+<script type="text/javascript" src="/assets/js/nnav.js"></script> </script>
+</head>
+
+	<body>
+		<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<header id="header" >
+  
+  <div id="head" class="parallax" parallax-speed="3" >
+    <div id="logo" class="text-center"> <img class="img-circle" id="circlelogo"
src="/assets/images/kylin_logo.jpg"> <span class="title" >Apache Kylin™</span>
<span class="tagline">Extreme OLAP Engine for Big Data</span> 
+    </div>
+    <div class="text-center" style="
+      position: relative;
+      top: 66px;
+      width: 1080px;
+      margin: 0 auto;
+      z-index: 11;
+      margin-top: -253px;
+      text-align: right;"
+    >
+      <a href="http://apache.org/foundation/contributing.html" title="Support Apache"
style="margin-left: 150px;">
+          <img src="https://www.apache.org/images/SupportApache-small.png" style="height:
150px; width: 150px;">
+      </a>
+    </div>  
+  </div>
+  
+
+  <!-- Main Menu -->
+  <nav class="navbar navbar-default" role="navigation" id="nav-wrapper">
+  <div class="container-fluid" id="nav">
+    <!--
+    <img class="img-circle" width="40px" height="40px" id="circlelogo" src="/assets/images/kylin_logo.jpg">
+    -->
+    <!-- Brand and toggle get grouped for better mobile display -->
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+     
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
+      <ul class="nav navbar-nav">
+     <li><a href="/">Home</a></li>
+          <li><a href="/docs20" >Docs</a></li>
+          <li><a href="/download">Download</li>
+          <li><a href="/community" >Community</a></li>
+          <li><a href="/development" >Development</a></li>
+          <li><a href="/blog">Blog</li>
+          <li><a href="/cn" >中文版</a></li>  
+          <li><a href="https://twitter.com/apachekylin" target="_blank" class="fa
fa-twitter fa-lg" title="Twitter: @ApacheKylin" ></a></li>
+          <li><a href="https://github.com/apache/kylin" target="_blank" class="fa
fa-github-alt fa-lg" title="Github: apache/kylin" ></a></li>          
+          <li><a href="https://www.facebook.com/kylinio" target="_blank" class="fa
fa-facebook fa-lg" title="Facebook: kylin.io" ></a></li>   
+      </ul>      
+    </div><!-- /.navbar-collapse -->
+  </div><!-- /.container-fluid -->
+</nav>
+ </header>
+
+		<div class="page-content">
+			<header style=" padding:2em 0 0 0">
+			<div class="container" >
+				<h4 class="section-title"><span>Apache Kylin™ Technical Blog</span></h4>
+			</div>
+		</div>
+
+		<div class="container">
+			<div>
+				<article class="post-content" >	
+				<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<div class="post" style=" padding:2em 4em 4em 4em">
+
+  <header class="post-header">
+    <h1 class="post-title">A new measure for Percentile precalculation</h1>
+    <p class="post-meta" >Apr 1, 2017 • Dong Li</p>
+  </header>
+
+  <article class="post-content" >
+    <h2 id="introduction">Introduction</h2>
+
+<p>Since Apache Kylin 2.0, there’s a new measure for percentile precalculation,
which aims at (sub-)second latency for <strong>approximate</strong> percentile
analytics SQL queries. The implementation is based on <a href="https://github.com/tdunning/t-digest">t-digest</a>
library under Apachee 2.0 license, which provides a high-effecient data structure to save
aggregation counters and algorithm to calculate approximate result of percentile.</p>
+
+<h3 id="percentile">Percentile</h3>
+<p><em>From <a href="https://en.wikipedia.org/wiki/Percentile">wikipedia</a></em>:
A <strong>percentile</strong> (or a <strong>centile</strong>)
is a measure used in statistics indicating the value below which a given percentage of
observations in a group of observations fall. For example, the 20th percentile is the value
(or score) below which 20% of the observations may be found.</p>
+
+<p>In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a aggregation
function called <strong>percentile(&lt;Number Column&gt;, &lt;Double&gt;)</strong>:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">seller_id</span><span class="p">,</span> <span
class="n">percentile</span><span class="p">(</span><span class="n">price</span><span
class="p">,</span> <span class="mi">0</span><span class="p">.</span><span
class="mi">5</span><span class="p">)</span>
+<span class="k">FROM</span> <span class="n">test_kylin_fact</span>
+<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">seller_id</span>
+</code></pre>
+</div>
+
+<h3 id="how-to-use">How to use</h3>
+<p>If you know little about <em>Cubes</em>, please go to <a href="http://kylin.apache.org/docs20/tutorial/kylin_sample.html">QuickStart</a>
first to learn basic knowledge.</p>
+
+<p>Firstly, you need to add this column as measure in data model.</p>
+
+<p><img src="/images/blog/percentile_1.png" alt="" /></p>
+
+<p>Secondly, create a cube and add a PERCENTILE measure.</p>
+
+<p><img src="/images/blog/percentile_2.png" alt="" /></p>
+
+<p>Finally, build the cube and try some query.</p>
+
+<p><img src="/images/blog/percentile_3.png" alt="" /></p>
+
+  </article>
+
+</div>
+
+
+
+
+
+				</article>
+			</div>
+		</div>		
+		<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<footer id="underfooter">
+    <div class="container">
+        <div class="row">
+            <div class="col-md-12 widget">
+                <div class="widget-body" style="text-align:center">
+                    <a href="http://www.apache.org">
+                        <img id="asf-logo" alt="Apache Software Foundation" src="/assets/images/feather-small.gif">
+                    </a>
+
+                    <div>
+                        The contents of this website are © 2015 Apache Software Foundation
under the terms of the <a
+                            href="http://www.apache.org/licenses/LICENSE-2.0"> Apache
License v2 </a>. Apache Kylin and
+                        its logo are trademarks of the Apache Software Foundation.
+                    </div>
+
+                </div>
+            </div>
+        </div>
+        <!-- /row of widgets -->
+
+    </div>
+    <div></div>
+
+</footer>
+
+	<script src="/assets/js/jquery-1.9.1.min.js"></script> 
+	<script src="/assets/js/bootstrap.min.js"></script> 
+	<script src="/assets/js/main.js"></script>
+	</body>
+</html>
+
+
+
+

Modified: kylin/site/blog/index.html
URL: http://svn.apache.org/viewvc/kylin/site/blog/index.html?rev=1790118&r1=1790117&r2=1790118&view=diff
==============================================================================
--- kylin/site/blog/index.html (original)
+++ kylin/site/blog/index.html Tue Apr  4 14:00:31 2017
@@ -187,6 +187,12 @@
             
             <li>
         <h2 align="left" style="margin:0px">
+          <a class="post-link" href="/blog/2017/04/01/percentile-measure/">A new measure
for Percentile precalculation</a></h2><div align="left" class="post-meta">posted:
Apr 1, 2017</div>
+        
+      </li>
+    
+            <li>
+        <h2 align="left" style="margin:0px">
           <a class="post-link" href="/blog/2017/02/25/v2.0.0-beta-ready/">Apache Kylin
v2.0.0 Beta Announcement</a></h2><div align="left" class="post-meta">posted:
Feb 25, 2017</div>
         
       </li>
@@ -277,13 +283,13 @@
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" href="/cn/blog/2016/05/26/release-v1.5.2/">Apache Kylin
v1.5.2 正式发布</a></h2><div align="left" class="post-meta">posted:
May 26, 2016</div>
+          <a class="post-link" href="/blog/2016/05/26/release-v1.5.2/">Apache Kylin
v1.5.2 Release Announcement</a></h2><div align="left" class="post-meta">posted:
May 26, 2016</div>
         
       </li>
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" href="/blog/2016/05/26/release-v1.5.2/">Apache Kylin
v1.5.2 Release Announcement</a></h2><div align="left" class="post-meta">posted:
May 26, 2016</div>
+          <a class="post-link" href="/cn/blog/2016/05/26/release-v1.5.2/">Apache Kylin
v1.5.2 正式发布</a></h2><div align="left" class="post-meta">posted:
May 26, 2016</div>
         
       </li>
     
@@ -307,13 +313,13 @@
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" href="/blog/2016/03/17/release-v1.5.0/">Apache Kylin
v1.5.0 Release Announcement</a></h2><div align="left" class="post-meta">posted:
Mar 17, 2016</div>
+          <a class="post-link" href="/cn/blog/2016/03/17/release-v1.5.0/">Apache Kylin
v1.5.0 正式发布</a></h2><div align="left" class="post-meta">posted:
Mar 17, 2016</div>
         
       </li>
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" href="/cn/blog/2016/03/17/release-v1.5.0/">Apache Kylin
v1.5.0 正式发布</a></h2><div align="left" class="post-meta">posted:
Mar 17, 2016</div>
+          <a class="post-link" href="/blog/2016/03/17/release-v1.5.0/">Apache Kylin
v1.5.0 Release Announcement</a></h2><div align="left" class="post-meta">posted:
Mar 17, 2016</div>
         
       </li>
     

Modified: kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1790118&r1=1790117&r2=1790118&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Tue Apr  4 14:00:31 2017
@@ -19,11 +19,52 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Wed, 29 Mar 2017 06:59:03 -0700</pubDate>
-    <lastBuildDate>Wed, 29 Mar 2017 06:59:03 -0700</lastBuildDate>
+    <pubDate>Tue, 04 Apr 2017 06:59:04 -0700</pubDate>
+    <lastBuildDate>Tue, 04 Apr 2017 06:59:04 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>A new measure for Percentile precalculation</title>
+        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
+
+&lt;p&gt;Since Apache Kylin 2.0, there’s a new measure for percentile precalculation,
which aims at (sub-)second latency for &lt;strong&gt;approximate&lt;/strong&gt;
percentile analytics SQL queries. The implementation is based on &lt;a href=&quot;https://github.com/tdunning/t-digest&quot;&gt;t-digest&lt;/a&gt;
library under Apachee 2.0 license, which provides a high-effecient data structure to save
aggregation counters and algorithm to calculate approximate result of percentile.&lt;/p&gt;
+
+&lt;h3 id=&quot;percentile&quot;&gt;Percentile&lt;/h3&gt;
+&lt;p&gt;&lt;em&gt;From &lt;a href=&quot;https://en.wikipedia.org/wiki/Percentile&quot;&gt;wikipedia&lt;/a&gt;&lt;/em&gt;:
A &lt;strong&gt;percentile&lt;/strong&gt; (or a &lt;strong&gt;centile&lt;/strong&gt;)
is a measure used in statistics indicating the value below which a given percentage of
observations in a group of observations fall. For example, the 20th percentile is the value
(or score) below which 20% of the observations may be found.&lt;/p&gt;
+
+&lt;p&gt;In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with
a aggregation function called &lt;strong&gt;percentile(&amp;lt;Number Column&amp;gt;,
&amp;lt;Double&amp;gt;)&lt;/strong&gt;:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span
class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seller_id&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;percentile&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_kylin_fact&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;seller_id&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;how-to-use&quot;&gt;How to use&lt;/h3&gt;
+&lt;p&gt;If you know little about &lt;em&gt;Cubes&lt;/em&gt;, please
go to &lt;a href=&quot;http://kylin.apache.org/docs20/tutorial/kylin_sample.html&quot;&gt;QuickStart&lt;/a&gt;
first to learn basic knowledge.&lt;/p&gt;
+
+&lt;p&gt;Firstly, you need to add this column as measure in data model.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/percentile_1.png&quot; alt=&quot;&quot;
/&gt;&lt;/p&gt;
+
+&lt;p&gt;Secondly, create a cube and add a PERCENTILE measure.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/percentile_2.png&quot; alt=&quot;&quot;
/&gt;&lt;/p&gt;
+
+&lt;p&gt;Finally, build the cube and try some query.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/percentile_3.png&quot; alt=&quot;&quot;
/&gt;&lt;/p&gt;
+</description>
+        <pubDate>Sat, 01 Apr 2017 15:22:22 -0700</pubDate>
+        <link>http://kylin.apache.org/blog/2017/04/01/percentile-measure/</link>
+        <guid isPermaLink="true">http://kylin.apache.org/blog/2017/04/01/percentile-measure/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Apache Kylin v2.0.0 Beta Announcement</title>
         <description>&lt;p&gt;The Apache Kylin community is pleased to announce
the &lt;a href=&quot;http://kylin.apache.org/download/&quot;&gt;v2.0.0 beta
package&lt;/a&gt; is ready for download and test.&lt;/p&gt;
 
@@ -599,173 +640,6 @@ kylin_sales_cube is a cube name.&lt;br /
         
         
         <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Use Count Distinct in Apache Kylin</title>
-        <description>&lt;p&gt;Since v.1.5.3&lt;/p&gt;
-
-&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
-&lt;p&gt;Count Distinct is a commonly measure in OLAP analyze, usually used for uv,
etc. Apache Kylin offers two kinds of count distinct, approximately and precisely, differs
on resource and performance.&lt;/p&gt;
-
-&lt;h2 id=&quot;approximately-count-distinct&quot;&gt;Approximately Count
Distinct&lt;/h2&gt;
-&lt;p&gt;Apache Kylin implements approximately count distinct using HyperLogLog algorithm,
offered serveral precision, with the error rates from 9.75% to 1.22%. &lt;br /&gt;
-The result of measure has theorically upper limit in size, as 2^N bytes. For the max precision
N=16, the upper limit is 64KB, and the max error rate is 1.22%. &lt;br /&gt;
-This implementation’s pros is fast caculating and storage resource saving, but can’t
be used for precisely requirements.&lt;/p&gt;
-
-&lt;h2 id=&quot;precisely-count-distinct&quot;&gt;Precisely Count Distinct&lt;/h2&gt;
-&lt;p&gt;Apache Kylin also implements precisely count distinct based on bitmap. For
the data with type tiny int(byte), small int(short) and int, project the value into the bitmap
directly. For the data with type long, string and others, encode the value as String into
a dict, and project the dict id into the bitmap.&lt;br /&gt;
-The result of measure is the serialized data of bitmap, not just the count value. This makes
sure that the result is always correct with any roll-up, even across segments.&lt;br /&gt;
-This implementation’s pros is precise result, no error, but needs more storage resources.
One result size might be hundreds of MB, when the count distinct value over millions.&lt;/p&gt;
-
-&lt;h2 id=&quot;global-dictionary&quot;&gt;Global Dictionary&lt;/h2&gt;
-&lt;p&gt;Apache Kylin encodes values into dictionay at the segment level by default.
That means one value in different segments maybe encoded into different ID, then the result
of count distinct will be incorrect.&lt;/p&gt;
-
-&lt;p&gt;In v1.5.3 we introduce “Global Dictionary” with ensurance that
one value always be encoded into the same ID across different segments. Meanwhile, the capacity
of dictionary has expanded dramatically, upper to support 2 billion values in one dictionary.
It can also be used to replace the default dictionary which has 5 million values limitation.&lt;/p&gt;
-
-&lt;p&gt;Current version (v1.5.3) has no GUI for defining global dictionary yet,
you need manually edit the cube desc json like this:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&quot;dictionaries&quot;:
[
-    {
-          &quot;column&quot;: &quot;SUCPAY_USERID&quot;,
-	 	   &quot;reuse&quot;: &quot;USER_ID&quot;,
-          &quot;builder&quot;: &quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    }
-]
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;column&lt;/code&gt;
means the column which to be encoded, the &lt;code class=&quot;highlighter-rouge&quot;&gt;builder&lt;/code&gt;
specifies the dictionary builder, only &lt;code class=&quot;highlighter-rouge&quot;&gt;org.apache.kylin.dict.GlobalDictionaryBuilder&lt;/code&gt;
is available for now.&lt;br /&gt;
-The ‘reuse` is used to optimize the dict of more than one columns based on one dataset,
please refer the next section ‘Example’ for more details.&lt;/p&gt;
-
-&lt;p&gt;Higher version (v1.5.4 or above) provided GUI for global dictionary definetion,
the ‘Advanced Dictionaries’ part in step ‘Advanced Setting’ of cube designer.&lt;/p&gt;
-
-&lt;p&gt;The global dictionay cannot be used for dimension encoding for now, that
means if one column is used for both dimension and count distinct measure in one cube, its
dimension encoding should be others instead of dict.&lt;/p&gt;
-
-&lt;h2 id=&quot;example&quot;&gt;Example&lt;/h2&gt;
-&lt;p&gt;Here’s some example data:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: left&quot;&gt;DT&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;USER_ID&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;FLAG1&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;FLAG2&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;USER_ID_FLAG1&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;USER_ID_FLAG2&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-08&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-08&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-08&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-09&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-09&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-10&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;There’s basic columns &lt;code class=&quot;highlighter-rouge&quot;&gt;DT&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;USER_ID&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;FLAG1&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;FLAG2&lt;/code&gt;,
and condition columns &lt;code class=&quot;highlighter-rouge&quot;&gt;USER_ID_FLAG1=if(FLAG1=1,USER_ID,null)&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;USER_ID_FLAG2=if(FLAG2=1,USER_ID,null)&lt;/code&gt;.
Supposed the cube is builded by day, has 3 segments.&lt;/p&gt;
-
-&lt;p&gt;Without the global dictionay, the precisely count distinct in a semgent
is correct, but the roll-up acrros segments will be wrong. Here’s an example:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;select
count(distinct user_id_flag1) from table where dt in (&#39;2016-06-08&#39;, &#39;2016-06-09&#39;)
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-&lt;p&gt;The result is 2 but not 3. The reason is that the dict in 2016-06-08 segment
is AAA=&amp;gt;1, BBB=&amp;gt;1, and the dict in 2016-06-09 segment is CCC=&amp;gt;
1.&lt;br /&gt;
-With global dictionary config as below, the dict became as AAA=&amp;gt;1, BBB=&amp;gt;2,
CCC=&amp;gt;3, that will procude correct result.&lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-&quot;dictionaries&quot;: [
-    {
-      &quot;column&quot;: &quot;USER_ID_FLAG1&quot;,
-      &quot;builder&quot;: &quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    }
-]
-&lt;/code&gt;&lt;/p&gt;
-
-&lt;p&gt;Actually, the data of USER_ID_FLAG1 and USER_ID_FLAG2 both are a subset
of USER_ID dataset, that made the dictionary re-using possible. Just encode the USER_ID dataset,
and config USER_ID_FLAG1 and USER_ID_FLAG2 resue USER_ID dict:&lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-&quot;dictionaries&quot;: [
-    {
-      &quot;column&quot;: &quot;USER_ID&quot;,
-      &quot;builder&quot;: &quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    },
-    {
-      &quot;column&quot;: &quot;USER_ID_FLAG1&quot;,
-      &quot;reuse&quot;: &quot;USER_ID&quot;,
-      &quot;builder&quot;: &quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    },
-    {
-      &quot;column&quot;: &quot;USER_ID_FLAG2&quot;,
-      &quot;reuse&quot;: &quot;USER_ID&quot;,
-      &quot;builder&quot;: &quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    }
-]
-&lt;/code&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;performance-tunning&quot;&gt;Performance Tunning&lt;/h2&gt;
-&lt;p&gt;When using global dictionary and the dictionary is large, the step ‘Build
Base Cuboid Data’ may took long time. That mainly caused by the dictionary cache loading
and eviction cost, since the dictionary size is bigger than mapper memory size. To solve this
problem, overwrite the cube configuration as following, adjust the mapper size to 8GB:&lt;br
/&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-kylin.job.mr.config.override.mapred.map.child.java.opts=-Xmx8g
-kylin.job.mr.config.override.mapreduce.map.memory.mb=8500
-&lt;/code&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;
-&lt;p&gt;Here’s some basically pricipal to decide which kind of count distinct
will be used:&lt;br /&gt;
- - If the result with error rate is acceptable, approximately way is always an better way&lt;br
/&gt;
- - If you need precise result, the only way is precisely count distinct&lt;br /&gt;
- - If you don’t need roll-up across segments (like non-partitioned cube), or the column
data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary;
otherwise the global dictionary should be configured, and also consider the “reuse”
column optimization&lt;/p&gt;
-</description>
-        <pubDate>Mon, 01 Aug 2016 11:30:00 -0700</pubDate>
-        <link>http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/</link>
-        <guid isPermaLink="true">http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/</guid>
-        
-        
-        <category>blog</category>
         
       </item>
     

Added: kylin/site/images/blog/percentile_1.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_1.png?rev=1790118&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/percentile_1.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/percentile_2.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_2.png?rev=1790118&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/percentile_2.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/percentile_3.png
URL: http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_3.png?rev=1790118&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/percentile_3.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream



Mime
View raw message