kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: Partition and Split rows
Date Thu, 12 May 2016 17:50:09 GMT
Hey Sand,

Sorry for the delayed response.  I'm not quite following your use case.  Is
the requirement to pre-aggregate by time window? I don't think Kudu can
help you directly with that (nothing built in), but you could always create
a separate table to store the pre-aggregated values.  As far as applying
functions to do row splits, that is an interesting idea, but I think once
Kudu has support for range bounds (the non-covering range partition design
doc linked above), you can simply create the bounds where the function
would have put them.  For example, if you want a partition for every five
minutes, you can create the bounds accordingly.

Earlier this week I gave a talk on timeseries in Kudu, I've included some
slides that may be interesting to you.  Additionally, you may want to check
out https://github.com/danburkert/kudu-ts, it's a very young  (not feature
complete) metrics layer on top of Kudu, it may give you some ideas.

- Dan

On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

> Thanks for sharing, Dan. The diagrams explained clearly how the current
> system works.
> As for things in my mind. Take the schema of <host,metric,time,...>, say,
> I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at
> 5 mins interval for the past 3 days, 7 days, ... Looks like I need to
> introduce a special 5-min bar column, use that column to do range partition
> to spread data across the tablet servers so that I could leverage parallel
> filtering.
> The cost of this extra column (INT8) is not ideal but not too bad either
> (storage cost wise, compression should do wonders). So I am thinking
> whether it would be better to take "functions" as row split instead of only
> constants. Of course if business requires to drop down to 1-min bar, the
> data has to be re-sharded again. So a more cost effective way of doing this
> on a production cluster would be good.
> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com> wrote:
>> Hi Sand,
>> I've been working on some diagrams to help explain some of the more
>> advanced partitioning types, it's attached.   Still pretty rough at this
>> point, but the goal is to clean it up and move it into the Kudu
>> documentation proper.  I'm interested to hear what kind of time series you
>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>> series, you can follow progress here
>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>> additional ideas I'd love to hear them.  You may also be interested in a
>> small project that a JD and I have been working on in the past week to
>> build an OpenTSDB style store on top of Kudu, you can find it here
>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
>> this point.
>> - Dan
>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com>
>> wrote:
>>> Thanks. Will read.
>>> Given that I am researching time series data, row locality is crucial
>>> :-)
>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>> wrote:
>>>> We do have non-covering range partitions coming in the next few months,
>>>> here's the design (in review):
>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>> The "Background & Motivation" section should give you a good idea of
>>>> why I'm mentioning this.
>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>> could be good enough.
>>>> J-D
>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>>> wrote:
>>>>> Makes sense.
>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>> after the table is created. Now, I have to "think ahead" to pre-create
>>>>> range buckets.
>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>> data and eventually you'll get some data distribution even if it
>>>>>> start in an ideal situation. Tablet splitting will come later for
>>>>>> J-D
>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>> wrote:
>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>> specify the split rows?
>>>>>>> Thanks!
>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>> wrote:
>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
>>>>>>>> unclear how the range partition column names associated with
the partial
>>>>>>>> rows params in the addSplitRow API.
>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>> Hi Sand,
>>>>>>>>> Please have a look at
>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>> and see if it is helpful to you.
>>>>>>>>> Thanks,
>>>>>>>>> Misty
>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows
work. I know
>>>>>>>>>> from some docs, this is currently for pre-creation
the table. I am
>>>>>>>>>> researching how to partition (hash+range) some time
series test data.
>>>>>>>>>> Is there an example? or notes somewhere I could read
>>>>>>>>>> Thanks much.

View raw message