Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B84DF200C21 for ; Mon, 6 Feb 2017 06:05:36 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B6E29160B65; Mon, 6 Feb 2017 05:05:36 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AEFA7160B59 for ; Mon, 6 Feb 2017 06:05:35 +0100 (CET) Received: (qmail 72245 invoked by uid 500); 6 Feb 2017 05:05:34 -0000 Mailing-List: contact user-help@kylin.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kylin.apache.org Delivered-To: mailing list user@kylin.apache.org Received: (qmail 72235 invoked by uid 99); 6 Feb 2017 05:05:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2017 05:05:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 06E8AC1FE9 for ; Mon, 6 Feb 2017 05:05:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 7s1X3Bu9lhzQ for ; Mon, 6 Feb 2017 05:05:32 +0000 (UTC) Received: from mail-lf0-f42.google.com (mail-lf0-f42.google.com [209.85.215.42]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 5AF525FB46 for ; Mon, 6 Feb 2017 05:05:32 +0000 (UTC) Received: by mail-lf0-f42.google.com with SMTP id n124so38260208lfd.2 for ; Sun, 05 Feb 2017 21:05:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=8BT4akT8vNC4tcrysTRfJ2irwfryG7JYyPDpe0tlpwQ=; b=e7ZvUegB96Fpdfm7+M4bU3DuTiKV4+Uak8elLNhHn55wF+Bs135JxUZ3rvJl62d5g2 gGlsFhChXYbQ26xx+zMOE0vCr6c4Ztbxsc4QViX1/d0gDiuIFLY0n3CkNx9Nzdopiw8V HI1u3k+N1tY3vfZq9M4LJh910KEXySSrwqbIkV9jzIWj1YLqI3lFUdHbZ0JNToW5c4Zr e2UOBBkCeAqDlJbDfKI75p2Tfv1ByQm4Xae16AmWkAd4Y13yYBK1jgs8Gk9RQX61AjAu jm6bMR5rtg/hwiCYi/lqj6LScys7W7KerPDF1tVb1V+kXjWdhCUXFt24ubfybWoaUVNN f0FQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=8BT4akT8vNC4tcrysTRfJ2irwfryG7JYyPDpe0tlpwQ=; b=ryz0t2mu7SlYNemXJftRGms/+LT+MA9tb0CZ4bW0VcRrFshFskcyn2HakjR8VvbMGW QEmeV/iuba/qI0EfiQfDn5WFJwBs5lrtFodccF7UX80p2vXWXfVq9vyJ2j17GRm+OGa3 Y8sHVQf4K2kHTovbDap6TjwzCIczjlS/IzNONk1tMSI3B7YKlzWDYdTM4cZLN6EvwqFZ PZlMDdabbhQbOt2rDTwN5tMc0t3wYst/xDhBpPELIusGmIl/qzgXN7yZAg/DMkyE2bYO ShcCgL1iz1F5LrwpHqG81EpVntp3QROdq5ItGUGIZ0uI0e9sLv2DqBd/IUJxg7xlAMCo 0RJg== X-Gm-Message-State: AIkVDXJRxHijPJPf/GmY68hRQDPKxC6fPQ5Tej1+qoEGZPkLeS+fvxJwoRVbvX7Haig7G+K3KExn7iDHl75IEQ== X-Received: by 10.25.195.145 with SMTP id t139mr2430785lff.96.1486357525803; Sun, 05 Feb 2017 21:05:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.17.204 with HTTP; Sun, 5 Feb 2017 21:05:25 -0800 (PST) In-Reply-To: References: From: Ajay Chitre Date: Sun, 5 Feb 2017 21:05:25 -0800 Message-ID: Subject: Re: New document: "How to optimize cube build" To: user@kylin.apache.org Content-Type: multipart/alternative; boundary=94eb2c1a162055b7b80547d59481 archived-at: Mon, 06 Feb 2017 05:05:36 -0000 --94eb2c1a162055b7b80547d59481 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks for writing this document. It's very helpful. I've following questions: 1) Doc says... "Kylin will build dictionaries in memory (in next version this will be moved to MR)". Which version can we expect this in? For large Cubes this process takes a long time on local machine. We really need to move this to the Hadoop cluster. In fact, it will be great if we can have an option to run this under Spark -:) 2) About the "Build N-Dimension Cuboid" step. Does Kylin build ALL Cuboids? My understanding is: Total no. of Cuboids =3D (2 to the power of # of dimensions) - 1 Correct? So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin create ALL of them? I was under the impression that, after some point, Kylin will just get measures from the Base Cuboid; instead of building all of them. Please explain. Thanks. On Sat, Feb 4, 2017 at 2:19 AM, Li Yang wrote: > Be free to update the document with different opinions. :-) > > On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi > wrote: > >> Hi Alberto, >> >> Thanks for your comments! In many cases the data is imported to Hadoop i= n >> T+1 mode. Especially when everyday's data is tens of GB, it is >> reasonable to partition the Hive table by date. The problem is whether i= t >> worth to keep a long history data in Hive; Usually user only keep a coup= le >> monthes' data in Hive; If the partition number exceeds the threshold in >> Hive, he/she can remove the oldest partitions or move to another table >> easily; That is a common practice of Hive I think, and it is very good t= o >> know that Hive 2.0 will solve this. >> >> 2017-01-25 17:10 GMT+08:00 Alberto Ram=C3=B3n : >> >>> Be careful about partition by "FLIGHTDATE" >>> >>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance >>> >>> *"Option 1: Use id_date as partition column on Hive table. This have a >>> big problem: the Hive metastore is meant for few hundred of partitions = not >>> thousand (Hive 9452 there is an idea to solve this isn=E2=80=99t in pro= gress)*" >>> >>> In Hive 2.0 will be a preview (only for testing) to solve this >>> >>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi : >>> >>>> Hello, >>>> >>>> A new document is added for the practices of cube build. Any suggestio= n >>>> or comment is welcomed. We can update the doc later with feedbacks; >>>> >>>> Here is the link: >>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html >>>> >>>> -- >>>> Best regards, >>>> >>>> Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B >>>> >>>> >>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B >> >> > --94eb2c1a162055b7b80547d59481 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks for writing this document. It's very helpful. I= 've following questions:

1) Doc says... "Kylin will build d= ictionaries in memory (in next version this will be moved to MR)".
=

Which version can we expect this in= ? For large Cubes this process takes a long time on local machine. We reall= y need to move this to the Hadoop cluster. In fact, it will be great if we = can have an option to run this under Spark -:)=C2=A0

2) Ab= out the "Build N-Dimension Cuboid" step.

Does Ky= lin build ALL Cuboids? My understanding is:

Total no. of C= uboids =3D (2 to the power of # of dimensions) - 1

Correct= ?

So if there are 7 dimensions, there will be 127 Cuboids,= right? Does Kylin create ALL of them?

I was under the imp= ression that, after some point, Kylin will just get measures from the Base = Cuboid; instead of building all of them. Please explain.

T= hanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <liyang@apache.org&= gt; wrote:
Be fre= e to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <shaofeng= shi@apache.org> wrote:
Hi Alberto,

Thanks for your comments!=C2=A0= In many cases the data is imported to Hadoop in T+1 mode. Especially when e= veryday's data is tens of GB, it is reasonable=C2=A0to partition the Hi= ve table by date. The problem is whether it worth to keep a long history da= ta in Hive; Usually user only keep a couple monthes' data in Hive; If t= he partition number exceeds the threshold in Hive, he/she can remove the ol= dest partitions or move to another table easily; That is a common practice = of Hive I think, and it is very good to know that Hive 2.0 will solve this.= =C2=A0

2017-01-25 17:10 GMT+08:00 Alberto Ram=C3=B3n &= lt;a.ramonpo= rtoles@gmail.com>:
Be careful about partition by "FLIGHTDATE"

From https://github.com/albertoR= amon/Kylin/tree/master/KylinPerfo<= wbr>rmance

"Option 1: Use id_date as partition column= on Hive table. This have a big problem: the Hive metastore is meant for few hundred of = partitions not=20 thousand (Hive 9452 there is an idea to solve this isn=E2=80=99t in progres= s)"

In Hive 2.0 will be a preview (only for testing) = to solve this

2017-01-25 9:46 = GMT+01:00 ShaoFeng Shi <shaofengshi@apache.org>:
Hello,

A ne= w document is added for the practices of cube build. Any suggestion or comm= ent is welcomed. We can update the doc later with feedbacks;

=
Here is the link:
https://kylin.apa= che.org/docs16/howto/howto_optimize_build.html

--
Best r= egards,

Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B





--
=
B= est regards,

Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B



--94eb2c1a162055b7b80547d59481--