Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7A4312009F7 for ; Sat, 7 May 2016 22:28:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 78C83160A01; Sat, 7 May 2016 20:28:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 724F51609F6 for ; Sat, 7 May 2016 22:28:16 +0200 (CEST) Received: (qmail 79117 invoked by uid 500); 7 May 2016 20:28:15 -0000 Mailing-List: contact user-help@kudu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.incubator.apache.org Delivered-To: mailing list user@kudu.incubator.apache.org Received: (qmail 79109 invoked by uid 99); 7 May 2016 20:28:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 May 2016 20:28:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4595AC1517 for ; Sat, 7 May 2016 20:28:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id jo2-rrIXC35z for ; Sat, 7 May 2016 20:28:13 +0000 (UTC) Received: from mail-ob0-f169.google.com (mail-ob0-f169.google.com [209.85.214.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 7C4585F3F5 for ; Sat, 7 May 2016 20:28:12 +0000 (UTC) Received: by mail-ob0-f169.google.com with SMTP id n10so67271811obb.2 for ; Sat, 07 May 2016 13:28:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=zceXSgM7Ro1bfsPEhmUOrV+cUW/EbR7qz6TQnvkZCyY=; b=PxI4oQJGhJ3PE06wam3pzQQN8vlvKwNIk8lQx39rNxOr3tyqvhZcAic2aYf833szQB ior2/cJQv8bDp6eCXTyLN745HGz/p+JV0nQV2Cge4b5MNhDMCXuJv/TAG3dQ6ytiHmry uUAVxAqIqZJoAHSbF2K6kyAukPHtM196ud6Ont3uSQVQF8RDaybFbtPLtKBVo8Cmr3Ut WtiNPwgLBpz14bS5zUi+gEs/djxfV37WWBm7kG9l84lh0R7BlewO4q+F2KpqzbHzxpzS wQShyUEq0a7T0F+Seze26VLP5kwSIXlWMjU4HjQOhOI9KWpP9GHEIrhdqcCRLF+9tDd3 /Vcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=zceXSgM7Ro1bfsPEhmUOrV+cUW/EbR7qz6TQnvkZCyY=; b=gAvN+ImVMi0ybYdXE+y+7yJKAVwttEGiA10Jy7Z7kepFIj2aPErzzAe9CH0yFfMuOI u2BWhQPL9tK2rNjLvSpHN4zQwzM7VQ5gHy94Tih2tUTx30NEu6MKtjoueFvOEvyswRJn J4k07vrmF6OLtUhbxqa1VZlDjp1idwgn1fk1LBsmS52w7URl3RpBosup+8ZOPWDgeynq MgxFBKkJyj2X2SkWj8qmobwEJnNlHyn70WUBHndnpHlpGKydpUDsM3eQIN4U9eZa5kOT CktDvdQ0BHXAWpUSJkOXuqj6Gns1Cx4u3jdJqMQvXpKUqptvDrLseoJNemdcOnpQ+JtJ UJxg== X-Gm-Message-State: AOPr4FVDkN1UKw14bCcEyPTPj+dNEy9ZfJgPRl77arhNjZWuFRpZfR8IK+f5soEpShh0LQ1hdPw5eqSQgm0GMA== MIME-Version: 1.0 X-Received: by 10.60.59.168 with SMTP id a8mr1115546oer.69.1462652885305; Sat, 07 May 2016 13:28:05 -0700 (PDT) Received: by 10.202.205.212 with HTTP; Sat, 7 May 2016 13:28:05 -0700 (PDT) In-Reply-To: References: Date: Sat, 7 May 2016 13:28:05 -0700 Message-ID: Subject: Re: Partition and Split rows From: Sand Stone To: user@kudu.incubator.apache.org Content-Type: multipart/alternative; boundary=089e015372aca890f70532466992 archived-at: Sat, 07 May 2016 20:28:17 -0000 --089e015372aca890f70532466992 Content-Type: text/plain; charset=UTF-8 Thanks for sharing, Dan. The diagrams explained clearly how the current system works. As for things in my mind. Take the schema of , say, I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I need to introduce a special 5-min bar column, use that column to do range partition to spread data across the tablet servers so that I could leverage parallel filtering. The cost of this extra column (INT8) is not ideal but not too bad either (storage cost wise, compression should do wonders). So I am thinking whether it would be better to take "functions" as row split instead of only constants. Of course if business requires to drop down to 1-min bar, the data has to be re-sharded again. So a more cost effective way of doing this on a production cluster would be good. On Sat, May 7, 2016 at 8:50 AM, Dan Burkert wrote: > Hi Sand, > > I've been working on some diagrams to help explain some of the more > advanced partitioning types, it's attached. Still pretty rough at this > point, but the goal is to clean it up and move it into the Kudu > documentation proper. I'm interested to hear what kind of time series you > are interested in Kudu for. I'm tasked with improving Kudu for time > series, you can follow progress here > . If you have any > additional ideas I'd love to hear them. You may also be interested in a > small project that a JD and I have been working on in the past week to > build an OpenTSDB style store on top of Kudu, you can find it here > . Still quite feature limited at > this point. > > - Dan > > On Fri, May 6, 2016 at 4:51 PM, Sand Stone wrote: > >> Thanks. Will read. >> >> Given that I am researching time series data, row locality is crucial :-) >> >> >> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans >> wrote: >> >>> We do have non-covering range partitions coming in the next few months, >>> here's the design (in review): >>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md >>> >>> The "Background & Motivation" section should give you a good idea of why >>> I'm mentioning this. >>> >>> Meanwhile, if you don't need row locality, using hash partitioning could >>> be good enough. >>> >>> J-D >>> >>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone >>> wrote: >>> >>>> Makes sense. >>>> >>>> Yeah it would be cool if users could specify/control the split rows >>>> after the table is created. Now, I have to "think ahead" to pre-create the >>>> range buckets. >>>> >>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans >>> > wrote: >>>> >>>>> You will only get 1 tablet and no data distribution, which is bad. >>>>> >>>>> That's also how HBase works, but it will split regions as you insert >>>>> data and eventually you'll get some data distribution even if it doesn't >>>>> start in an ideal situation. Tablet splitting will come later for Kudu. >>>>> >>>>> J-D >>>>> >>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone >>>>> wrote: >>>>> >>>>>> One more questions, how does the range partition work if I don't >>>>>> specify the split rows? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone >>>>>> wrote: >>>>>> >>>>>>> Thanks, Misty. The "advanced" impala example helped. >>>>>>> >>>>>>> I was just reading the Java API,CreateTableOptions.java, it's >>>>>>> unclear how the range partition column names associated with the partial >>>>>>> rows params in the addSplitRow API. >>>>>>> >>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones < >>>>>>> mstanleyjones@cloudera.com> wrote: >>>>>>> >>>>>>>> Hi Sand, >>>>>>>> >>>>>>>> Please have a look at >>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables >>>>>>>> and see if it is helpful to you. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Misty >>>>>>>> >>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know >>>>>>>>> from some docs, this is currently for pre-creation the table. I am >>>>>>>>> researching how to partition (hash+range) some time series test data. >>>>>>>>> >>>>>>>>> Is there an example? or notes somewhere I could read upon. >>>>>>>>> >>>>>>>>> Thanks much. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --089e015372aca890f70532466992 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks for sharing, Dan. The diagrams explained clearly ho= w the current system works.=C2=A0

As for things in my mi= nd. Take the schema of <host,metric,time,...>, say, I am interested i= n data for the past 5 mins, 10 mins, etc. Or, aggregate at 5 mins interval = for the past 3 days, 7 days, ... Looks like I need to introduce a special 5= -min bar column, use that column to do range partition to spread data acros= s the tablet servers so that I could leverage parallel filtering.=C2=A0
The cost of this extra column (INT8) is not ideal but not t= oo bad either (storage cost wise, compression should do wonders). So I am t= hinking whether it would be better to take "functions" as row spl= it instead of only constants. Of course if business requires to drop down t= o 1-min bar, the data has to be re-sharded again. So a more cost effective = way of doing this on a production cluster would be good.=C2=A0



On Sat, May 7, 2016 at 8:50 AM, Dan Burker= t <dan@cloudera.com> wrote:
Hi Sand,

I've been working on some= diagrams to help explain some of the more advanced partitioning types, it&= #39;s attached. =C2=A0 Still pretty rough at this point, but the goal is to= clean it up and move it into the Kudu documentation proper.=C2=A0 I'm = interested to hear what kind of time series you are interested in Kudu for.= =C2=A0 I'm tasked with improving Kudu for time series, you can follow p= rogress here. If you have any additional ideas I'd love to hear= them.=C2=A0 You may also be interested in a small project that a JD and I = have been working on in the past week to build an OpenTSDB style store on t= op of Kudu, you can find it=C2=A0here.=C2=A0 Still quite feature limited at th= is point.

- Dan

On Fri, May 6,= 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com> w= rote:
Thanks. Will read.= =C2=A0

Given that I am researching time series data, row= locality is crucial :-) =C2=A0

On Fri, May 6, 2016 at 3:57 PM, Jean-Da= niel Cryans <jdcryans@apache.org> wrote:
We do have non-covering range partitions = coming in the next few months, here's the design (in review):=C2=A0http://gerrit.cloudera.org:8080= /#/c/2772/9/docs/design-docs/non-covering-range-partitions.md

<= /div>
The "Background & Motivation" section should give y= ou a good idea of why I'm mentioning this.

Mea= nwhile, if you don't need row locality, using hash partitioning could b= e good enough.

J-D

On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.s= tone@gmail.com> wrote:
Makes sense.=C2=A0

Yeah it would be cool if= users could specify/control the split rows after the table is created. Now= , I have to "think ahead" to pre-create the range buckets.=C2=A0<= /div>

On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans &l= t;jdcryans@apache.= org> wrote:
You will only get 1 tablet and no data distribution, which is bad.
That's also how HBase works, but it will split regions as = you insert data and eventually you'll get some data distribution even i= f it doesn't start in an ideal situation. Tablet splitting will come la= ter for Kudu.

J-D

On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.s= tone@gmail.com> wrote:
One more questions, how does the range partition work if I don= 't specify the split rows?=C2=A0

Thanks!=C2=A0
=

O= n Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com= > wrote:
Thanks, Misty. The "ad= vanced" impala example helped.=C2=A0

I was just r= eading the Java API,CreateTableOptions.java, it's unclear how the range= partition column names associated with the partial rows params in the=C2= =A0addSplitRow API.=

<= div class=3D"gmail_quote">On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jon= es <mstanleyjones@cloudera.com> wrote:
Hi Sand,

Please h= ave a look at=C2=A0http://getkudu.io/docs/kudu_i= mpala_integration.html#partitioning_tables and see if it is helpful to = you.

Thanks,
Misty
=

On Fri, May 6, 20= 16 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com> wrot= e:
Hi, I am new to Kudu.= I wonder how the split rows work. I know from some docs, this is currently= for pre-creation the table. I am researching how to partition (hash+range)= some time series test data.=C2=A0

Is there an example? = or notes somewhere I could read upon.=C2=A0

Th= anks much.=C2=A0









--089e015372aca890f70532466992--