Mailing-List: contact user-help@kudu.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@kudu.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: <CALo2W-UJNa6A6RGck2-kj65H=3jg32MEJzfP5Batd9bqY6iC8w@mail.gmail.com>
References: <CADrk5qO3xiGBP1fvMSuOkmYQQ5=adGwAcnXudoZ3p8PSw=xzPw@mail.gmail.com>
	<CANqQv0uq6z_CwJZ4QWSppGng4w9kus4y8OSkvKJCJkSBvRUmJA@mail.gmail.com>
	<CADrk5qPk_UqqRAj8=mhu7V33CSBu7wQiwhiHAZEeHnnfaemjSQ@mail.gmail.com>
	<CADrk5qPKqLDdSB-RYA6BCqyHsK-nQG0fVvrsuystgA=qZO8faA@mail.gmail.com>
	<CAGpTDNcgL8sR4O0azOw4yTmY1OiMd+EqGsujfU36HQukxMXzwA@mail.gmail.com>
	<CADrk5qMzY=+viBbVXYgyAT4h_EDoijULTsdCeVaM+fK_kyrRmQ@mail.gmail.com>
	<CAGpTDNeCA7e0gGUzztAayEjZ9ODcOAN8uXBtFG+aKJmRTFSR-Q@mail.gmail.com>
	<CADrk5qOZtAn44zE3j_KXWABo_R1unR0v6J9Wb=HV+QO7-o=Wug@mail.gmail.com>
	<CALo2W-UJNa6A6RGck2-kj65H=3jg32MEJzfP5Batd9bqY6iC8w@mail.gmail.com>
Date: Sat, 7 May 2016 13:28:05 -0700
Message-ID: <CADrk5qM-ToaGeEGegBDPQByc_XQcBxWtDk8YTmM2Fg6p-NZLBw@mail.gmail.com>
Subject: Re: Partition and Split rows
From: Sand Stone <sand.m.stone@gmail.com>
To: user@kudu.incubator.apache.org
Content-Type: multipart/alternative; boundary=089e015372aca890f70532466992
archived-at: Sat, 07 May 2016 20:28:17 -0000

--089e015372aca890f70532466992
Content-Type: text/plain; charset=UTF-8

Thanks for sharing, Dan. The diagrams explained clearly how the current
system works.

As for things in my mind. Take the schema of <host,metric,time,...>, say, I
am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at 5
mins interval for the past 3 days, 7 days, ... Looks like I need to
introduce a special 5-min bar column, use that column to do range partition
to spread data across the tablet servers so that I could leverage parallel
filtering.

The cost of this extra column (INT8) is not ideal but not too bad either
(storage cost wise, compression should do wonders). So I am thinking
whether it would be better to take "functions" as row split instead of only
constants. Of course if business requires to drop down to 1-min bar, the
data has to be re-sharded again. So a more cost effective way of doing this
on a production cluster would be good.


On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com> wrote:

> Hi Sand,
>
> I've been working on some diagrams to help explain some of the more
> advanced partitioning types, it's attached.   Still pretty rough at this
> point, but the goal is to clean it up and move it into the Kudu
> documentation proper.  I'm interested to hear what kind of time series you
> are interested in Kudu for.  I'm tasked with improving Kudu for time
> series, you can follow progress here
> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
> additional ideas I'd love to hear them.  You may also be interested in a
> small project that a JD and I have been working on in the past week to
> build an OpenTSDB style store on top of Kudu, you can find it here
> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
> this point.
>
> - Dan
>
> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
>
>> Thanks. Will read.
>>
>> Given that I am researching time series data, row locality is crucial :-)
>>
>>
>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>> wrote:
>>
>>> We do have non-covering range partitions coming in the next few months,
>>> here's the design (in review):
>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>
>>> The "Background & Motivation" section should give you a good idea of why
>>> I'm mentioning this.
>>>
>>> Meanwhile, if you don't need row locality, using hash partitioning could
>>> be good enough.
>>>
>>> J-D
>>>
>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>> wrote:
>>>
>>>> Makes sense.
>>>>
>>>> Yeah it would be cool if users could specify/control the split rows
>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>> range buckets.
>>>>
>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>
>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>
>>>>> That's also how HBase works, but it will split regions as you insert
>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> One more questions, how does the range partition work if I don't
>>>>>> specify the split rows?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>
>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>> rows params in the addSplitRow API.
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Hi Sand,
>>>>>>>>
>>>>>>>> Please have a look at
>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>> and see if it is helpful to you.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Misty
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>
>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>
>>>>>>>>> Thanks much.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--089e015372aca890f70532466992
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for sharing, Dan. The diagrams explained clearly ho=
w the current system works.=C2=A0<div><br></div><div>As for things in my mi=
nd. Take the schema of &lt;host,metric,time,...&gt;, say, I am interested i=
n data for the past 5 mins, 10 mins, etc. Or, aggregate at 5 mins interval =
for the past 3 days, 7 days, ... Looks like I need to introduce a special 5=
-min bar column, use that column to do range partition to spread data acros=
s the tablet servers so that I could leverage parallel filtering.=C2=A0<div=
><br></div><div>The cost of this extra column (INT8) is not ideal but not t=
oo bad either (storage cost wise, compression should do wonders). So I am t=
hinking whether it would be better to take &quot;functions&quot; as row spl=
it instead of only constants. Of course if business requires to drop down t=
o 1-min bar, the data has to be re-sharded again. So a more cost effective =
way of doing this on a production cluster would be good.=C2=A0</div><div><b=
r></div><div><br></div><div><br></div></div></div><div class=3D"gmail_extra=
"><br><div class=3D"gmail_quote">On Sat, May 7, 2016 at 8:50 AM, Dan Burker=
t <span dir=3D"ltr">&lt;<a href=3D"mailto:dan@cloudera.com" target=3D"_blan=
k">dan@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
><div dir=3D"ltr">Hi Sand,<div><br></div><div>I&#39;ve been working on some=
 diagrams to help explain some of the more advanced partitioning types, it&=
#39;s attached. =C2=A0 Still pretty rough at this point, but the goal is to=
 clean it up and move it into the Kudu documentation proper.=C2=A0 I&#39;m =
interested to hear what kind of time series you are interested in Kudu for.=
=C2=A0 I&#39;m tasked with improving Kudu for time series, you can follow p=
rogress <a href=3D"https://issues.apache.org/jira/browse/KUDU-1306" target=
=3D"_blank">here</a>. If you have any additional ideas I&#39;d love to hear=
 them.=C2=A0 You may also be interested in a small project that a JD and I =
have been working on in the past week to build an OpenTSDB style store on t=
op of Kudu, you can find it=C2=A0<a href=3D"https://github.com/danburkert/k=
udu-ts" target=3D"_blank">here</a>.=C2=A0 Still quite feature limited at th=
is point.</div><span class=3D"HOEnZb"><font color=3D"#888888"><div><br></di=
v><div>- Dan</div></font></span></div><div class=3D"HOEnZb"><div class=3D"h=
5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, May 6,=
 2016 at 4:51 PM, Sand Stone <span dir=3D"ltr">&lt;<a href=3D"mailto:sand.m=
.stone@gmail.com" target=3D"_blank">sand.m.stone@gmail.com</a>&gt;</span> w=
rote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks. Will read.=
=C2=A0<div><br></div><div>Given that I am researching time series data, row=
 locality is crucial :-) =C2=A0</div></div><div><div><div class=3D"gmail_ex=
tra"><br><div class=3D"gmail_quote">On Fri, May 6, 2016 at 3:57 PM, Jean-Da=
niel Cryans <span dir=3D"ltr">&lt;<a href=3D"mailto:jdcryans@apache.org" ta=
rget=3D"_blank">jdcryans@apache.org</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex"><div dir=3D"ltr">We do have non-covering range partitions =
coming in the next few months, here&#39;s the design (in review):=C2=A0<a h=
ref=3D"http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-cove=
ring-range-partitions.md" target=3D"_blank">http://gerrit.cloudera.org:8080=
/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md</a><div><br><=
/div><div>The &quot;Background &amp; Motivation&quot; section should give y=
ou a good idea of why I&#39;m mentioning this.</div><div><br></div><div>Mea=
nwhile, if you don&#39;t need row locality, using hash partitioning could b=
e good enough.</div><span><font color=3D"#888888"><div><br></div><div>J-D</=
div></font></span></div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Fri, May 6, 2016 at 3:53 PM, Sand Stone <span dir=3D"lt=
r">&lt;<a href=3D"mailto:sand.m.stone@gmail.com" target=3D"_blank">sand.m.s=
tone@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">Makes sense.=C2=A0<div><br></div><div>Yeah it would be cool if=
 users could specify/control the split rows after the table is created. Now=
, I have to &quot;think ahead&quot; to pre-create the range buckets.=C2=A0<=
/div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <span dir=3D"ltr">&l=
t;<a href=3D"mailto:jdcryans@apache.org" target=3D"_blank">jdcryans@apache.=
org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr=
">You will only get 1 tablet and no data distribution, which is bad.<div><b=
r></div><div>That&#39;s also how HBase works, but it will split regions as =
you insert data and eventually you&#39;ll get some data distribution even i=
f it doesn&#39;t start in an ideal situation. Tablet splitting will come la=
ter for Kudu.</div><span><font color=3D"#888888"><div><br></div><div>J-D</d=
iv></font></span></div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Fri, May 6, 2016 at 3:42 PM, Sand Stone <span dir=3D"lt=
r">&lt;<a href=3D"mailto:sand.m.stone@gmail.com" target=3D"_blank">sand.m.s=
tone@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">One more questions, how does the range partition work if I don=
&#39;t specify the split rows?=C2=A0<div><br></div><div>Thanks!=C2=A0</div>=
</div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">O=
n Fri, May 6, 2016 at 3:37 PM, Sand Stone <span dir=3D"ltr">&lt;<a href=3D"=
mailto:sand.m.stone@gmail.com" target=3D"_blank">sand.m.stone@gmail.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><font=
 color=3D"#000000" face=3D"verdana, sans-serif">Thanks, Misty. The &quot;ad=
vanced&quot; impala example helped.=C2=A0</font><div><span style=3D"backgro=
und-color:rgb(255,255,255)"><font color=3D"#000000" face=3D"verdana, sans-s=
erif"><br></font></span></div><div><span style=3D"background-color:rgb(255,=
255,255)"><font color=3D"#000000" face=3D"verdana, sans-serif">I was just r=
eading the Java API,CreateTableOptions.java, it&#39;s unclear how the range=
 partition column names associated with the partial rows params in the=C2=
=A0<span style=3D"line-height:16.8px;white-space:pre-wrap">addSplitRow API.=
</span></font></span></div></div><div><div><div class=3D"gmail_extra"><br><=
div class=3D"gmail_quote">On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jon=
es <span dir=3D"ltr">&lt;<a href=3D"mailto:mstanleyjones@cloudera.com" targ=
et=3D"_blank">mstanleyjones@cloudera.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div dir=3D"ltr">Hi Sand,<div><br></div><div>Please h=
ave a look at=C2=A0<a href=3D"http://getkudu.io/docs/kudu_impala_integratio=
n.html#partitioning_tables" target=3D"_blank">http://getkudu.io/docs/kudu_i=
mpala_integration.html#partitioning_tables</a> and see if it is helpful to =
you.</div><div><br></div><div>Thanks,</div><div>Misty</div></div><div><div>=
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, May 6, 20=
16 at 2:00 PM, Sand Stone <span dir=3D"ltr">&lt;<a href=3D"mailto:sand.m.st=
one@gmail.com" target=3D"_blank">sand.m.stone@gmail.com</a>&gt;</span> wrot=
e:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi, I am new to Kudu.=
 I wonder how the split rows work. I know from some docs, this is currently=
 for pre-creation the table. I am researching how to partition (hash+range)=
 some time series test data.=C2=A0<div><br></div><div>Is there an example? =
or notes somewhere I could read upon.=C2=A0<br></div><div><br></div><div>Th=
anks much.=C2=A0</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--089e015372aca890f70532466992--