Mailing-List: contact user-help@kylin.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@kylin.apache.org
MIME-Version: 1.0
In-Reply-To: <CAHRce1MqMBGuAvTqTfo8wB5GNj+k5Y2WsWa355qxvL0jyQaXYA@mail.gmail.com>
References: <CANfpUcv8swyGVGC+2OWH7WMahU7zdUerGN8MruOZ8_UsjNbYAA@mail.gmail.com>
 <CAEcyM15JG-ZzZ0j3DkisRoJz1F-aYJn8iSh9FYKtu-0qAeMsOw@mail.gmail.com>
 <CANfpUcsQQ76=xr1rzJwUR1iXt9ggrg=wOq7hwFGnGKFRO4EUsg@mail.gmail.com> <CAHRce1MqMBGuAvTqTfo8wB5GNj+k5Y2WsWa355qxvL0jyQaXYA@mail.gmail.com>
From: Ajay Chitre <chitre.ajay@gmail.com>
Date: Sun, 5 Feb 2017 21:05:25 -0800
Message-ID: <CAMhsnBBUx271Vbn4CtoCWm8VsTo729U7ONe3OZpsKCaWLR1Sag@mail.gmail.com>
Subject: Re: New document: "How to optimize cube build"
To: user@kylin.apache.org
Content-Type: multipart/alternative; boundary=94eb2c1a162055b7b80547d59481
archived-at: Mon, 06 Feb 2017 05:05:36 -0000

--94eb2c1a162055b7b80547d59481
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks for writing this document. It's very helpful. I've following
questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version
this will be moved to MR)".

Which version can we expect this in? For large Cubes this process takes a
long time on local machine. We really need to move this to the Hadoop
cluster. In fact, it will be great if we can have an option to run this
under Spark -:)

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids =3D (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin
create ALL of them?

I was under the impression that, after some point, Kylin will just get
measures from the Base Cuboid; instead of building all of them. Please
explain.

Thanks.


On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <liyang@apache.org> wrote:

> Be free to update the document with different opinions. :-)
>
> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <shaofengshi@apache.org>
> wrote:
>
>> Hi Alberto,
>>
>> Thanks for your comments! In many cases the data is imported to Hadoop i=
n
>> T+1 mode. Especially when everyday's data is tens of GB, it is
>> reasonable to partition the Hive table by date. The problem is whether i=
t
>> worth to keep a long history data in Hive; Usually user only keep a coup=
le
>> monthes' data in Hive; If the partition number exceeds the threshold in
>> Hive, he/she can remove the oldest partitions or move to another table
>> easily; That is a common practice of Hive I think, and it is very good t=
o
>> know that Hive 2.0 will solve this.
>>
>> 2017-01-25 17:10 GMT+08:00 Alberto Ram=C3=B3n <a.ramonportoles@gmail.com=
>:
>>
>>> Be careful about partition by "FLIGHTDATE"
>>>
>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance
>>>
>>> *"Option 1: Use id_date as partition column on Hive table. This have a
>>> big problem: the Hive metastore is meant for few hundred of partitions =
not
>>> thousand (Hive 9452 there is an idea to solve this isn=E2=80=99t in pro=
gress)*"
>>>
>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>
>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <shaofengshi@apache.org>:
>>>
>>>> Hello,
>>>>
>>>> A new document is added for the practices of cube build. Any suggestio=
n
>>>> or comment is welcomed. We can update the doc later with feedbacks;
>>>>
>>>> Here is the link:
>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B
>>>>
>>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B
>>
>>
>

--94eb2c1a162055b7b80547d59481
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span style=3D"font-family:arial,helvetica,sans-serif;colo=
r:rgb(51,51,51)">Thanks for writing this document. It&#39;s very helpful. I=
&#39;ve following questions:<br><br>1) Doc says... &quot;Kylin will build d=
ictionaries in memory (in next version this will be moved to MR)&quot;.<br>=
</span><br style=3D"font-family:arial,helvetica,sans-serif"><span style=3D"=
font-family:arial,helvetica,sans-serif">Which version can we expect this in=
? For large Cubes this process takes a long time on local machine. We reall=
y need to move this to the Hadoop cluster. In fact, it will be great if we =
can have an option to run this under Spark -:)=C2=A0</span><br style=3D"fon=
t-family:arial,helvetica,sans-serif"><br style=3D"font-family:arial,helveti=
ca,sans-serif"><span style=3D"font-family:arial,helvetica,sans-serif">2) Ab=
out the &quot;Build N-Dimension Cuboid&quot; step.</span><br style=3D"font-=
family:arial,helvetica,sans-serif"><br style=3D"font-family:arial,helvetica=
,sans-serif"><span style=3D"font-family:arial,helvetica,sans-serif">Does Ky=
lin build ALL Cuboids? My understanding is:</span><br style=3D"font-family:=
arial,helvetica,sans-serif"><br style=3D"font-family:arial,helvetica,sans-s=
erif"><span style=3D"font-family:arial,helvetica,sans-serif">Total no. of C=
uboids =3D (2 to the power of # of dimensions) - 1</span><br style=3D"font-=
family:arial,helvetica,sans-serif"><br style=3D"font-family:arial,helvetica=
,sans-serif"><span style=3D"font-family:arial,helvetica,sans-serif">Correct=
?</span><br style=3D"font-family:arial,helvetica,sans-serif"><br style=3D"f=
ont-family:arial,helvetica,sans-serif"><span style=3D"font-family:arial,hel=
vetica,sans-serif">So if there are 7 dimensions, there will be 127 Cuboids,=
 right? Does Kylin create ALL of them?</span><br style=3D"font-family:arial=
,helvetica,sans-serif"><br style=3D"font-family:arial,helvetica,sans-serif"=
><span style=3D"font-family:arial,helvetica,sans-serif">I was under the imp=
ression that, after some point, Kylin will just get measures from the Base =
Cuboid; instead of building all of them. Please explain.</span><br style=3D=
"font-family:arial,helvetica,sans-serif"><br style=3D"font-family:arial,hel=
vetica,sans-serif"><span style=3D"font-family:arial,helvetica,sans-serif">T=
hanks.</span><br style=3D"font-family:arial,helvetica,sans-serif"><br style=
=3D"font-family:arial,helvetica,sans-serif"><br style=3D"font-family:arial,=
helvetica,sans-serif"></div><div class=3D"gmail_extra"><br><div class=3D"gm=
ail_quote">On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:liyang@apache.org" target=3D"_blank">liyang@apache.org</a>&=
gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Be fre=
e to update the document with different opinions. :-)<br></div><div class=
=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <span dir=3D"lt=
r">&lt;<a href=3D"mailto:shaofengshi@apache.org" target=3D"_blank">shaofeng=
shi@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">Hi Alberto,<div><br></div><div>Thanks for your comments!=C2=A0=
In many cases the data is imported to Hadoop in T+1 mode. Especially when e=
veryday&#39;s data is tens of GB, it is reasonable=C2=A0to partition the Hi=
ve table by date. The problem is whether it worth to keep a long history da=
ta in Hive; Usually user only keep a couple monthes&#39; data in Hive; If t=
he partition number exceeds the threshold in Hive, he/she can remove the ol=
dest partitions or move to another table easily; That is a common practice =
of Hive I think, and it is very good to know that Hive 2.0 will solve this.=
=C2=A0</div></div><div class=3D"m_-6972037731034623214HOEnZb"><div class=3D=
"m_-6972037731034623214h5"><div class=3D"gmail_extra"><br><div class=3D"gma=
il_quote">2017-01-25 17:10 GMT+08:00 Alberto Ram=C3=B3n <span dir=3D"ltr">&=
lt;<a href=3D"mailto:a.ramonportoles@gmail.com" target=3D"_blank">a.ramonpo=
rtoles@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><div>Be careful about partition by &quot;<span id=3D"m_-6972037731=
034623214m_-1740350388969125934m_1394321228243391914:1sj.7">FLIGHTDATE</spa=
n>&quot;<br><br>From https://<span id=3D"m_-6972037731034623214m_-174035038=
8969125934m_1394321228243391914:1sj.8">github</span>.com/<span id=3D"m_-697=
2037731034623214m_-1740350388969125934m_1394321228243391914:1sj.9">albertoR=
amo<wbr>n</span>/<span id=3D"m_-6972037731034623214m_-1740350388969125934m_=
1394321228243391914:1sj.10">Kylin</span>/tree/master/<span id=3D"m_-6972037=
731034623214m_-1740350388969125934m_1394321228243391914:1sj.11">KylinPerfo<=
wbr>rmance</span><br><br><i>&quot;Option 1: Use id_date as partition column=
 on Hive table. This have a big
 problem: the Hive <span id=3D"m_-6972037731034623214m_-1740350388969125934=
m_1394321228243391914:1sj.12">metastore</span> is meant for few hundred of =
partitions not=20
thousand (Hive 9452 there is an idea to solve this isn=E2=80=99t in progres=
s)</i>&quot;<br><br></div>In Hive 2.0 will be a preview (only for testing) =
to solve this<br></div><div class=3D"m_-6972037731034623214m_-1740350388969=
125934HOEnZb"><div class=3D"m_-6972037731034623214m_-1740350388969125934h5"=
><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-01-25 9:46 =
GMT+01:00 ShaoFeng Shi <span dir=3D"ltr">&lt;<a href=3D"mailto:shaofengshi@=
apache.org" target=3D"_blank">shaofengshi@apache.org</a>&gt;</span>:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div dir=3D"ltr">Hello,<div><br></div><div>A ne=
w document is added for the practices of cube build. Any suggestion or comm=
ent is welcomed. We can update the doc later with feedbacks;</div><div><br>=
</div><div>Here is the link:</div><div><a href=3D"https://kylin.apache.org/=
docs16/howto/howto_optimize_build.html" target=3D"_blank">https://kylin.apa=
che.org/docs1<wbr>6/howto/howto_optimize_build.h<wbr>tml</a><span class=3D"=
m_-6972037731034623214m_-1740350388969125934m_1394321228243391914HOEnZb"><f=
ont color=3D"#888888"><br clear=3D"all"><div><br></div>-- <br><div class=3D=
"m_-6972037731034623214m_-1740350388969125934m_1394321228243391914m_-551533=
5897475040377gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr">Best r=
egards,<div><br></div><div>Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B</div><d=
iv><br></div></div></div></div></div>
</font></span></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"m_-6972037731034623214m_-1740350388969125934gmail_signature" =
data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr">B=
est regards,<div><br></div><div>Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B</d=
iv><div><br></div></div></div></div></div>
</div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--94eb2c1a162055b7b80547d59481--