kylin-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From luke...@apache.org
Subject svn commit: r1703083 - in /incubator/kylin/site: blog/2015/09/06/release-v1.0-incubating/index.html blog/2015/09/09/ blog/2015/09/09/fast-cubing-on-spark/ blog/2015/09/09/fast-cubing-on-spark/index.html blog/index.html feed.xml
Date Tue, 15 Sep 2015 02:29:23 GMT
Author: lukehan
Date: Tue Sep 15 02:29:22 2015
New Revision: 1703083

URL: http://svn.apache.org/r1703083
Log:
publish fast cubing blog and fix some typo

Added:
    incubator/kylin/site/blog/2015/09/09/
    incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/
    incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html
Modified:
    incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html
    incubator/kylin/site/blog/index.html
    incubator/kylin/site/feed.xml

Modified: incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html
URL: http://svn.apache.org/viewvc/incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html?rev=1703083&r1=1703082&r2=1703083&view=diff
==============================================================================
--- incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html (original)
+++ incubator/kylin/site/blog/2015/09/06/release-v1.0-incubating/index.html Tue Sep 15 02:29:22
2015
@@ -196,11 +196,11 @@
 <p><strong>Kylin Core Improvement</strong></p>
 
 <ul>
-  <li>Dynamic Data Model has been supported for new added or removed column in data
model without rebuild cube from the beginning <a href="https://issues.apache.org/jira/browse/KYLIN-867">KYLIN-867</a></li>
+  <li>Dynamic Data Model has been added to supporting adding or removing column in
data model without rebuild cube from the beginning <a href="https://issues.apache.org/jira/browse/KYLIN-867">KYLIN-867</a></li>
   <li>Upgraded Apache Calcite to 1.3 for more bug fixes and new SQL functions <a
href="https://issues.apache.org/jira/browse/KYLIN-881">KYLIN-881</a></li>
   <li>Cleanup job enhanced to make sure there’s no garbage files left in OS and
HDFS/HBase after job build <a href="https://issues.apache.org/jira/browse/KYLIN-926">KYLIN-926</a></li>
   <li>Added setting option for Hive intermediate tables created by Kylin <a href="https://issues.apache.org/jira/browse/KYLIN-883">KYLIN-883</a></li>
-  <li>HBase Corprocessor enhanced to imrpove query performance <a href="https://issues.apache.org/jira/browse/KYLIN-857">KYLIN-857</a></li>
+  <li>HBase coprocessor enhanced to imrpove query performance <a href="https://issues.apache.org/jira/browse/KYLIN-857">KYLIN-857</a></li>
   <li>Kylin System Dashboard for usage, storage, performance <a href="https://issues.apache.org/jira/browse/KYLIN-792">KYLIN-792</a></li>
 </ul>
 
@@ -217,11 +217,11 @@
 
 <p><strong>Zeppelin Integration</strong></p>
 
-<p><a href="http://zeppelin.incubator.apache.org/">Apache Zeppelin</a>
is a web-based notebook that enables interactive data analytics. The Apache Kylin team has
contributed Kylin Interpreter which enable Zeppelin interactive with Kylin from notebook using
ANSI SQL, this interpreter could be found from Zeppelin master code repo <a href="https://github.com/apache/incubator-zeppelin/tree/master/kylin">here</a>.</p>
+<p><a href="http://zeppelin.incubator.apache.org/">Apache Zeppelin</a>
is a web-based notebook that enables interactive data analytics. The Apache Kylin team has
contributed Kylin Interpreter which enables Zeppelin interaction with Kylin from notebook
using ANSI SQL, this interpreter could be found from Zeppelin master code repo <a href="https://github.com/apache/incubator-zeppelin/tree/master/kylin">here</a>.</p>
 
 <p><strong>Upgrade</strong></p>
 
-<p>We recommend to upgrade to this version from v0.7.x or even more early version for
better performance, stablility and more clear one (most of the intermediate files will be
cleaned up automatically). Also to keep up to date with community with latest features and
supports.<br />
+<p>We recommend to upgrade to this version from v0.7.x or even more early version for
better performance, stablility and clear one (most of the intermediate files will be cleaned
up automatically). Also to keep up to date with community with latest features and supports.<br
/>
 Any issue or question during upgrade, please send to Apache Kylin dev mailing list: <a
href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#064;&#107;&#121;&#108;&#105;&#110;&#046;&#105;&#110;&#099;&#117;&#098;&#097;&#116;&#111;&#114;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">&#100;&#101;&#118;&#064;&#107;&#121;&#108;&#105;&#110;&#046;&#105;&#110;&#099;&#117;&#098;&#097;&#116;&#111;&#114;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;</a></p>
 
 <p><em>Great thanks to everyone who contributed!</em></p>

Added: incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html
URL: http://svn.apache.org/viewvc/incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html?rev=1703083&view=auto
==============================================================================
--- incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html (added)
+++ incubator/kylin/site/blog/2015/09/09/fast-cubing-on-spark/index.html Tue Sep 15 02:29:22
2015
@@ -0,0 +1,373 @@
+<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+<!doctype html>
+<html>
+	<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+
+  <title>Apache Kylin | Fast Cubing on Spark in Apache Kylin</title>
+  <meta name="description" content="Preparation">
+  <meta name="author"      content="Apache Kylin">
+  <link rel="shortcut icon" href="fav.png" type="image/png">
+
+
+
+<link rel="stylesheet" href="/assets/css/animate.css">
+<!-- Bootstrap -->
+<link rel="stylesheet" href="/assets/css/bootstrap.min.css">
+
+<!-- Fonts -->
+<!-- <link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Alice|Open+Sans:400,300,700">
-->
+
+<!-- Icons -->
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+
+  <!-- Custom styles -->
+  <link rel="stylesheet" href="/assets/css/styles.css">
+  <link rel="stylesheet" href="/assets/css/docs.css">
+  <link rel="stylesheet" href="/assets/css/pygments.css">
+
+  <link rel="canonical" href="http://kylin.incubator.apache.org/blog/2015/09/09/fast-cubing-on-spark/">
+  <link rel="alternate" type="application/rss+xml" title="Apache Kylin" href="http://kylin.incubator.apache.org/feed.xml"
/>
+
+<!--[if lt IE 9]> <script src="assets/js/html5shiv.js"></script> <![endif]-->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+  //oringal tracker for kylin.io
+  ga('create', 'UA-55534813-1', 'auto');
+  //new tracker for kylin.incubator.apache.org
+  ga('create', 'UA-55534813-2', 'auto', {'name':'incubator'});
+
+  ga('send', 'pageview');
+  ga('incubator.send', 'pageview');
+
+
+</script>
+<script type="text/javascript" src="/assets/js/jquery-1.9.1.min.js"></script>
+<script type="text/javascript" src="/assets/js/nside.js"></script> </script>
+<script type="text/javascript" src="/assets/js/nnav.js"></script> </script>
+</head>
+
+	<body>
+		<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<header id="header" >
+  
+  <div id="head" class="parallax" parallax-speed="3" >
+    <div id="logo" class="text-center"> <img class="img-circle" id="circlelogo"
src="/assets/images/kylin_logo.jpg"> <span class="title" >Apache Kylin</span>
<span class="tagline">Extreme OLAP Engine for Big Data</span> 
+    </div>
+  </div>
+  
+
+  <!-- Main Menu -->
+  <nav class="navbar navbar-default" role="navigation" id="nav-wrapper">
+  <div class="container-fluid" id="nav">
+    <!--
+    <img class="img-circle" width="40px" height="40px" id="circlelogo" src="/assets/images/kylin_logo.jpg">
+    -->
+    <!-- Brand and toggle get grouped for better mobile display -->
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+     
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
+      <ul class="nav navbar-nav">
+     <li><a href="/">Home</a></li>
+          <li><a href="/docs" >Docs</a></li>
+          <li><a href="/download">Download</li>
+          <li><a href="/community" >Community</a></li>
+          <li><a href="/development" >Development</a></li>
+          <li><a href="/blog">Blog</li>
+          <li><a href="/cn" >中文版</a></li>  
+          <li><a href="https://twitter.com/apachekylin" target="_blank" class="fa
fa-twitter fa-lg" title="Twitter: @ApacheKylin" ></a></li>
+          <li><a href="https://github.com/apache/incubator-kylin" target="_blank"
class="fa fa-github-alt fa-lg" title="Github: apache/incubator-kylin" ></a></li>
         
+          <li><a href="https://www.facebook.com/kylinio" target="_blank" class="fa
fa-facebook fa-lg" title="Facebook: kylin.io" ></a></li>   
+      </ul>      
+    </div><!-- /.navbar-collapse -->
+  </div><!-- /.container-fluid -->
+</nav>
+ </header>
+
+		<div class="page-content">
+			<header style=" padding:2em 0 0 0">
+			<div class="container" >
+				<h4 class="section-title"><span>Kylin Technical Blog</span></h4>
+			</div>
+		</div>
+
+		<div class="container">
+			<div>
+				<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<div class="post" style=" padding:2em 4em 4em 4em">
+
+  <header class="post-header">
+    <h1 class="post-title">Fast Cubing on Spark in Apache Kylin</h1>
+    <p class="post-meta" >Sep 9, 2015 • Qianhao Zhou</p>
+  </header>
+
+  <article class="post-content" >
+    <h2 id="preparation">Preparation</h2>
+
+<p>In order to make POC phase as simple as possible, a standalone spark cluster is
the best choice.<br />
+So the environment setup is as below:</p>
+
+<ol>
+  <li>
+    <p>hadoop sandbox (hortonworks hdp 2.2.0)</p>
+
+    <p>(8 cores, 16G) * 1</p>
+  </li>
+  <li>
+    <p>spark (1.4.1)</p>
+
+    <p>master:(4 cores, 8G)</p>
+
+    <p>worker:(4 cores, 8G) * 2</p>
+  </li>
+</ol>
+
+<p>The hadoop conf should also be in the SPARK_HOME/conf</p>
+
+<h2 id="fast-cubing-implementation-on-spark">Fast Cubing Implementation on Spark</h2>
+
+<p>Spark as a computation framework has provided much richer operators than map-reduce.
And some of them are quite suitable for the cubing algorithm, for instance <strong>aggregate</strong>.</p>
+
+<p>As the <a href="http://kylin.incubator.apache.org/blog/2015/08/15/fast-cubing/"
title="Fast Cubing Algorithm in Apache Kylin">Fast cubing algorithm</a>, it contains
several steps:</p>
+
+<ol>
+  <li>build dictionary</li>
+  <li>calculate region split for hbase</li>
+  <li>build &amp; output cuboid data</li>
+</ol>
+
+<hr />
+
+<p><strong>build dictionary</strong></p>
+
+<p>In order to build dictionary, distinct values of the column are needed, which new
API <strong><em>DataFrame</em></strong> has already provided(since
spark 1.3.0).</p>
+
+<p>So after got the data from the hive through SparkSQL, it is quite natural to directly
use the api to build dictionary.</p>
+
+<hr />
+
+<p><strong>calculate region split</strong></p>
+
+<p>In order to calculate the distribution of all cuboids, Kylin use a HyperLogLog implementation.
And each record will have a counter, whose size is by default 16KB each. So if the counter
shuffles across the cluster, that will be very expensive.</p>
+
+<p>Spark has provided an operator <strong><em>aggregate</em></strong>
to reduce shuffle size. It first does a map-reduce phase locally, and then another round of
reduce to merge the data from each node.</p>
+
+<hr />
+
+<p><strong>build &amp; output cuboid data</strong></p>
+
+<p>In order to build cube, Kylin requires a small batch which can fit into memory in
the same time.</p>
+
+<p>Previously in map-reduce implementation, Kylin leverage the life-cycle callback
<strong>cleanup</strong> to gather all the input together as a batch. This cannot
be directly applied in the map reduce operator in spark which we don’t have such life-cycle
callback.</p>
+
+<p>However spark has provided an operator <strong><em>glom</em></strong>
which coalescing all elements within each partition into an array which is exactly Kylin want
to build a small batch.</p>
+
+<p>Once the batch data is ready, we can just apply the Fast Cubing algorithm.</p>
+
+<p>Then spark api <strong><em>saveAsNewAPIHadoopFile</em></strong>
allow us to write hfile to hdfs and bulk load to HBase.</p>
+
+<h2 id="statistics">Statistics</h2>
+
+<p>We use the sample data Kylin provided to build cube, total record count is 10000.</p>
+
+<p>Below are results(system environments are mentioned above)</p>
+<table>
+    <tr>
+        <td></td>
+        <td>Spark</td>
+        <td>MR</td>
+    </tr>
+    <tr>
+        <td>Duration</td>
+        <td>5.5 min</td>
+        <td>10+ min</td>
+    </tr>
+</table>
+
+<h2 id="issues">Issues</h2>
+
+<p>Since hdp 2.2+ requires Hive 0.14.0 while spark 1.3.0 only supports Hive 0.13.0.
There are several compatibility problems in hive-site.xml we need to fix.</p>
+
+<ol>
+  <li>
+    <p>some time-related settings</p>
+
+    <p>There are several settings, whose default value in hive 0.14.0 cannot be parsed
in 0.13.0. Such as <strong>hive.metastore.client.connect.retry.delay</strong>,
its default value is <strong>5s</strong>. And in hive 0.13.0, this value can only
be in the format of Long value. So you have to manually change to from <strong>5s</strong>
to <strong>5</strong>.</p>
+  </li>
+  <li>
+    <p>hive.security.authorization.manager</p>
+
+    <p>If you have enabled this configuration, its default value is <strong>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory</strong>
which is newly introduced in hive 0.14.0, it means you have to use the another implementation,
such as <strong>org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider</strong></p>
+  </li>
+  <li>
+    <p>hive.execution.engine</p>
+
+    <p>In hive 0.14.0, the default value of <strong>hive.execution.engine</strong>
is <strong>tez</strong>, change it to <strong>mr</strong> in the Spark
classpath, otherwise there will be NoClassDefFoundError.</p>
+  </li>
+</ol>
+
+<p>NOTE: Spark 1.4.0 has a <a href="https://issues.apache.org/jira/browse/SPARK-8368">bug</a>
which will lead to ClassNotFoundException. And it has been fixed in Spark 1.4.1. So if you
are planning to run on Spark 1.4.0, you may need to upgrade to 1.4.1</p>
+
+<p>Last but not least, when you trying to run Spark application on YARN, make sure
that you have hive-site.xml and hbase-site.xml in the  HADDOP_CONF_DIR or YARN_CONF_DIR. Since
by default HDP lays these conf in separate directories.</p>
+
+<h2 id="next-move">Next move</h2>
+
+<p>Clearly above is not a fair competition. The environment is not the same, test data
size is too small, etc.</p>
+
+<p>However it showed that it is practical to migrate from MR to Spark, while some useful
operators in Spark will save us quite a few codes.</p>
+
+<p>So the next move for us is to setup a cluster, do the benchmark on real data set
for both MR and Spark.</p>
+
+<p>We will update the benchmark once we finished, please stay tuned.</p>
+
+  </article>
+
+</div>
+
+
+
+
+
+			</div>
+		</div>		
+		<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<footer id="underfooter">
+  <div class="container">
+    <div class="row">
+      <div class="col-md-12 widget" >
+        <div class="widget-body" style="text-align:center">
+          <div>
+          Apache Kylin is an effort undergoing incubation at The Apache Software Foundation
(ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects
until a further review indicates that the infrastructure, communications, and decision making
process have stabilized in a manner consistent with other successful ASF projects. While incubation
status is not necessarily a reflection of the completeness or stability of the code, it does
indicate that the project has yet to be fully endorsed by the ASF.
+          </div>
+        <a href="http://www.apache.org">
+            <img id="asf-logo" alt="Apache Software Foundation" src="/assets/images/feather-small.gif">
+        </a>
+        <a href="http://incubator.apache.org/">
+            <img id="incubator-logo" alt="Apache Incubator" src="/assets/images/egg-logo.png">
+        </a>
+
+        <div id="copyright">
+            <p>Copyright &#169; 2014 The Apache Software Foundation, Licensed under
the <a
+                    href="http://www.apache.org/licenses/LICENSE-2.0">Apache License,
Version 2.0</a>.<br/>Apache, the
+                Apache feather logo, and the Apache Incubator project logo are trademarks
of The Apache Software
+                Foundation.</p>
+        </div>
+        </div>
+      </div>
+    </div>
+    <!-- /row of widgets --> 
+
+  </div>
+  <div></div>
+  
+</footer>
+
+	<script src="/assets/js/jquery-1.9.1.min.js"></script> 
+	<script src="/assets/js/bootstrap.min.js"></script> 
+	<script src="/assets/js/main.js"></script>
+	</body>
+</html>
+
+
+
+

Modified: incubator/kylin/site/blog/index.html
URL: http://svn.apache.org/viewvc/incubator/kylin/site/blog/index.html?rev=1703083&r1=1703082&r2=1703083&view=diff
==============================================================================
--- incubator/kylin/site/blog/index.html (original)
+++ incubator/kylin/site/blog/index.html Tue Sep 15 02:29:22 2015
@@ -174,6 +174,12 @@
             
             <li>
         <h2 align="left">
+          <a class="post-link" href="/blog/2015/09/09/fast-cubing-on-spark/">Fast Cubing
on Spark in Apache Kylin</a></h2><div align="left" class="post-meta">posted:
Sep 9, 2015</div>
+        
+      </li>
+    
+            <li>
+        <h2 align="left">
           <a class="post-link" href="/blog/2015/09/06/release-v1.0-incubating/">Apache
Kylin 1.0 (incubating) Release Announcement</a></h2><div align="left" class="post-meta">posted:
Sep 6, 2015</div>
         
       </li>

Modified: incubator/kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/incubator/kylin/site/feed.xml?rev=1703083&r1=1703082&r2=1703083&view=diff
==============================================================================
--- incubator/kylin/site/feed.xml (original)
+++ incubator/kylin/site/feed.xml Tue Sep 15 02:29:22 2015
@@ -19,11 +19,140 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.incubator.apache.org/</link>
     <atom:link href="http://kylin.incubator.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Mon, 07 Sep 2015 18:48:58 -0700</pubDate>
-    <lastBuildDate>Mon, 07 Sep 2015 18:48:58 -0700</lastBuildDate>
+    <pubDate>Mon, 14 Sep 2015 19:28:08 -0700</pubDate>
+    <lastBuildDate>Mon, 14 Sep 2015 19:28:08 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>Fast Cubing on Spark in Apache Kylin</title>
+        <description>&lt;h2 id=&quot;preparation&quot;&gt;Preparation&lt;/h2&gt;
+
+&lt;p&gt;In order to make POC phase as simple as possible, a standalone spark cluster
is the best choice.&lt;br /&gt;
+So the environment setup is as below:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;hadoop sandbox (hortonworks hdp 2.2.0)&lt;/p&gt;
+
+    &lt;p&gt;(8 cores, 16G) * 1&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;spark (1.4.1)&lt;/p&gt;
+
+    &lt;p&gt;master:(4 cores, 8G)&lt;/p&gt;
+
+    &lt;p&gt;worker:(4 cores, 8G) * 2&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;The hadoop conf should also be in the SPARK_HOME/conf&lt;/p&gt;
+
+&lt;h2 id=&quot;fast-cubing-implementation-on-spark&quot;&gt;Fast Cubing
Implementation on Spark&lt;/h2&gt;
+
+&lt;p&gt;Spark as a computation framework has provided much richer operators than
map-reduce. And some of them are quite suitable for the cubing algorithm, for instance &lt;strong&gt;aggregate&lt;/strong&gt;.&lt;/p&gt;
+
+&lt;p&gt;As the &lt;a href=&quot;http://kylin.incubator.apache.org/blog/2015/08/15/fast-cubing/&quot;
title=&quot;Fast Cubing Algorithm in Apache Kylin&quot;&gt;Fast cubing algorithm&lt;/a&gt;,
it contains several steps:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;build dictionary&lt;/li&gt;
+  &lt;li&gt;calculate region split for hbase&lt;/li&gt;
+  &lt;li&gt;build &amp;amp; output cuboid data&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;hr /&gt;
+
+&lt;p&gt;&lt;strong&gt;build dictionary&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;In order to build dictionary, distinct values of the column are needed,
which new API &lt;strong&gt;&lt;em&gt;DataFrame&lt;/em&gt;&lt;/strong&gt;
has already provided(since spark 1.3.0).&lt;/p&gt;
+
+&lt;p&gt;So after got the data from the hive through SparkSQL, it is quite natural
to directly use the api to build dictionary.&lt;/p&gt;
+
+&lt;hr /&gt;
+
+&lt;p&gt;&lt;strong&gt;calculate region split&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;In order to calculate the distribution of all cuboids, Kylin use a HyperLogLog
implementation. And each record will have a counter, whose size is by default 16KB each. So
if the counter shuffles across the cluster, that will be very expensive.&lt;/p&gt;
+
+&lt;p&gt;Spark has provided an operator &lt;strong&gt;&lt;em&gt;aggregate&lt;/em&gt;&lt;/strong&gt;
to reduce shuffle size. It first does a map-reduce phase locally, and then another round of
reduce to merge the data from each node.&lt;/p&gt;
+
+&lt;hr /&gt;
+
+&lt;p&gt;&lt;strong&gt;build &amp;amp; output cuboid data&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;In order to build cube, Kylin requires a small batch which can fit into
memory in the same time.&lt;/p&gt;
+
+&lt;p&gt;Previously in map-reduce implementation, Kylin leverage the life-cycle callback
&lt;strong&gt;cleanup&lt;/strong&gt; to gather all the input together as a
batch. This cannot be directly applied in the map reduce operator in spark which we don’t
have such life-cycle callback.&lt;/p&gt;
+
+&lt;p&gt;However spark has provided an operator &lt;strong&gt;&lt;em&gt;glom&lt;/em&gt;&lt;/strong&gt;
which coalescing all elements within each partition into an array which is exactly Kylin want
to build a small batch.&lt;/p&gt;
+
+&lt;p&gt;Once the batch data is ready, we can just apply the Fast Cubing algorithm.&lt;/p&gt;
+
+&lt;p&gt;Then spark api &lt;strong&gt;&lt;em&gt;saveAsNewAPIHadoopFile&lt;/em&gt;&lt;/strong&gt;
allow us to write hfile to hdfs and bulk load to HBase.&lt;/p&gt;
+
+&lt;h2 id=&quot;statistics&quot;&gt;Statistics&lt;/h2&gt;
+
+&lt;p&gt;We use the sample data Kylin provided to build cube, total record count
is 10000.&lt;/p&gt;
+
+&lt;p&gt;Below are results(system environments are mentioned above)&lt;/p&gt;
+&lt;table&gt;
+    &lt;tr&gt;
+        &lt;td&gt;&lt;/td&gt;
+        &lt;td&gt;Spark&lt;/td&gt;
+        &lt;td&gt;MR&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+        &lt;td&gt;Duration&lt;/td&gt;
+        &lt;td&gt;5.5 min&lt;/td&gt;
+        &lt;td&gt;10+ min&lt;/td&gt;
+    &lt;/tr&gt;
+&lt;/table&gt;
+
+&lt;h2 id=&quot;issues&quot;&gt;Issues&lt;/h2&gt;
+
+&lt;p&gt;Since hdp 2.2+ requires Hive 0.14.0 while spark 1.3.0 only supports Hive
0.13.0. There are several compatibility problems in hive-site.xml we need to fix.&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;some time-related settings&lt;/p&gt;
+
+    &lt;p&gt;There are several settings, whose default value in hive 0.14.0 cannot
be parsed in 0.13.0. Such as &lt;strong&gt;hive.metastore.client.connect.retry.delay&lt;/strong&gt;,
its default value is &lt;strong&gt;5s&lt;/strong&gt;. And in hive 0.13.0,
this value can only be in the format of Long value. So you have to manually change to from
&lt;strong&gt;5s&lt;/strong&gt; to &lt;strong&gt;5&lt;/strong&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;hive.security.authorization.manager&lt;/p&gt;
+
+    &lt;p&gt;If you have enabled this configuration, its default value is &lt;strong&gt;org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory&lt;/strong&gt;
which is newly introduced in hive 0.14.0, it means you have to use the another implementation,
such as &lt;strong&gt;org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider&lt;/strong&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;hive.execution.engine&lt;/p&gt;
+
+    &lt;p&gt;In hive 0.14.0, the default value of &lt;strong&gt;hive.execution.engine&lt;/strong&gt;
is &lt;strong&gt;tez&lt;/strong&gt;, change it to &lt;strong&gt;mr&lt;/strong&gt;
in the Spark classpath, otherwise there will be NoClassDefFoundError.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;NOTE: Spark 1.4.0 has a &lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-8368&quot;&gt;bug&lt;/a&gt;
which will lead to ClassNotFoundException. And it has been fixed in Spark 1.4.1. So if you
are planning to run on Spark 1.4.0, you may need to upgrade to 1.4.1&lt;/p&gt;
+
+&lt;p&gt;Last but not least, when you trying to run Spark application on YARN, make
sure that you have hive-site.xml and hbase-site.xml in the  HADDOP_CONF_DIR or YARN_CONF_DIR.
Since by default HDP lays these conf in separate directories.&lt;/p&gt;
+
+&lt;h2 id=&quot;next-move&quot;&gt;Next move&lt;/h2&gt;
+
+&lt;p&gt;Clearly above is not a fair competition. The environment is not the same,
test data size is too small, etc.&lt;/p&gt;
+
+&lt;p&gt;However it showed that it is practical to migrate from MR to Spark, while
some useful operators in Spark will save us quite a few codes.&lt;/p&gt;
+
+&lt;p&gt;So the next move for us is to setup a cluster, do the benchmark on real
data set for both MR and Spark.&lt;/p&gt;
+
+&lt;p&gt;We will update the benchmark once we finished, please stay tuned.&lt;/p&gt;
+</description>
+        <pubDate>Wed, 09 Sep 2015 08:28:00 -0700</pubDate>
+        <link>http://kylin.incubator.apache.org/blog/2015/09/09/fast-cubing-on-spark/</link>
+        <guid isPermaLink="true">http://kylin.incubator.apache.org/blog/2015/09/09/fast-cubing-on-spark/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Apache Kylin 1.0 (incubating) Release Announcement</title>
         <description>&lt;p&gt;The Apache Kylin team is pleased to announce
the release of Apache Kylin v1.0 (incubating). Apache Kylin is an open source Distributed
Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on
Hadoop supporting extremely large datasets.&lt;/p&gt;
 
@@ -36,11 +165,11 @@
 &lt;p&gt;&lt;strong&gt;Kylin Core Improvement&lt;/strong&gt;&lt;/p&gt;
 
 &lt;ul&gt;
-  &lt;li&gt;Dynamic Data Model has been supported for new added or removed column
in data model without rebuild cube from the beginning &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-867&quot;&gt;KYLIN-867&lt;/a&gt;&lt;/li&gt;
+  &lt;li&gt;Dynamic Data Model has been added to supporting adding or removing column
in data model without rebuild cube from the beginning &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-867&quot;&gt;KYLIN-867&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Upgraded Apache Calcite to 1.3 for more bug fixes and new SQL functions
&lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-881&quot;&gt;KYLIN-881&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Cleanup job enhanced to make sure there’s no garbage files left
in OS and HDFS/HBase after job build &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-926&quot;&gt;KYLIN-926&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Added setting option for Hive intermediate tables created by Kylin &lt;a
href=&quot;https://issues.apache.org/jira/browse/KYLIN-883&quot;&gt;KYLIN-883&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;HBase Corprocessor enhanced to imrpove query performance &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-857&quot;&gt;KYLIN-857&lt;/a&gt;&lt;/li&gt;
+  &lt;li&gt;HBase coprocessor enhanced to imrpove query performance &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-857&quot;&gt;KYLIN-857&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Kylin System Dashboard for usage, storage, performance &lt;a href=&quot;https://issues.apache.org/jira/browse/KYLIN-792&quot;&gt;KYLIN-792&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 
@@ -57,11 +186,11 @@
 
 &lt;p&gt;&lt;strong&gt;Zeppelin Integration&lt;/strong&gt;&lt;/p&gt;
 
-&lt;p&gt;&lt;a href=&quot;http://zeppelin.incubator.apache.org/&quot;&gt;Apache
Zeppelin&lt;/a&gt; is a web-based notebook that enables interactive data analytics.
The Apache Kylin team has contributed Kylin Interpreter which enable Zeppelin interactive
with Kylin from notebook using ANSI SQL, this interpreter could be found from Zeppelin master
code repo &lt;a href=&quot;https://github.com/apache/incubator-zeppelin/tree/master/kylin&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;&lt;a href=&quot;http://zeppelin.incubator.apache.org/&quot;&gt;Apache
Zeppelin&lt;/a&gt; is a web-based notebook that enables interactive data analytics.
The Apache Kylin team has contributed Kylin Interpreter which enables Zeppelin interaction
with Kylin from notebook using ANSI SQL, this interpreter could be found from Zeppelin master
code repo &lt;a href=&quot;https://github.com/apache/incubator-zeppelin/tree/master/kylin&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
 
 &lt;p&gt;&lt;strong&gt;Upgrade&lt;/strong&gt;&lt;/p&gt;
 
-&lt;p&gt;We recommend to upgrade to this version from v0.7.x or even more early version
for better performance, stablility and more clear one (most of the intermediate files will
be cleaned up automatically). Also to keep up to date with community with latest features
and supports.&lt;br /&gt;
+&lt;p&gt;We recommend to upgrade to this version from v0.7.x or even more early version
for better performance, stablility and clear one (most of the intermediate files will be cleaned
up automatically). Also to keep up to date with community with latest features and supports.&lt;br
/&gt;
 Any issue or question during upgrade, please send to Apache Kylin dev mailing list: &lt;a
href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&lt;/a&gt;&lt;/p&gt;
 
 &lt;p&gt;&lt;em&gt;Great thanks to everyone who contributed!&lt;/em&gt;&lt;/p&gt;



Mime
View raw message