kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sand Stone <sand.m.st...@gmail.com>
Subject Re: Partition and Split rows
Date Sat, 07 May 2016 20:28:05 GMT
Thanks for sharing, Dan. The diagrams explained clearly how the current
system works.

As for things in my mind. Take the schema of <host,metric,time,...>, say, I
am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at 5
mins interval for the past 3 days, 7 days, ... Looks like I need to
introduce a special 5-min bar column, use that column to do range partition
to spread data across the tablet servers so that I could leverage parallel
filtering.

The cost of this extra column (INT8) is not ideal but not too bad either
(storage cost wise, compression should do wonders). So I am thinking
whether it would be better to take "functions" as row split instead of only
constants. Of course if business requires to drop down to 1-min bar, the
data has to be re-sharded again. So a more cost effective way of doing this
on a production cluster would be good.




On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com> wrote:

> Hi Sand,
>
> I've been working on some diagrams to help explain some of the more
> advanced partitioning types, it's attached.   Still pretty rough at this
> point, but the goal is to clean it up and move it into the Kudu
> documentation proper.  I'm interested to hear what kind of time series you
> are interested in Kudu for.  I'm tasked with improving Kudu for time
> series, you can follow progress here
> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
> additional ideas I'd love to hear them.  You may also be interested in a
> small project that a JD and I have been working on in the past week to
> build an OpenTSDB style store on top of Kudu, you can find it here
> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
> this point.
>
> - Dan
>
> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
>
>> Thanks. Will read.
>>
>> Given that I am researching time series data, row locality is crucial :-)
>>
>>
>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>> wrote:
>>
>>> We do have non-covering range partitions coming in the next few months,
>>> here's the design (in review):
>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>
>>> The "Background & Motivation" section should give you a good idea of why
>>> I'm mentioning this.
>>>
>>> Meanwhile, if you don't need row locality, using hash partitioning could
>>> be good enough.
>>>
>>> J-D
>>>
>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>> wrote:
>>>
>>>> Makes sense.
>>>>
>>>> Yeah it would be cool if users could specify/control the split rows
>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>> range buckets.
>>>>
>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>
>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>
>>>>> That's also how HBase works, but it will split regions as you insert
>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> One more questions, how does the range partition work if I don't
>>>>>> specify the split rows?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>
>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>> unclear how the range partition column names associated with
the partial
>>>>>>> rows params in the addSplitRow API.
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Hi Sand,
>>>>>>>>
>>>>>>>> Please have a look at
>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>> and see if it is helpful to you.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Misty
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work.
I know
>>>>>>>>> from some docs, this is currently for pre-creation the
table. I am
>>>>>>>>> researching how to partition (hash+range) some time series
test data.
>>>>>>>>>
>>>>>>>>> Is there an example? or notes somewhere I could read
upon.
>>>>>>>>>
>>>>>>>>> Thanks much.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message