kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sand Stone <sand.m.st...@gmail.com>
Subject Re: Partition and Split rows
Date Thu, 12 May 2016 18:05:55 GMT
> Is the requirement to pre-aggregate by time window?
No, I am thinking to create a column say, "minute". It's basically the
minute field of the timestamp column(even round to 5-min bucket depending
on the needs). So it's a computed column being filled in on data ingestion.
My goal is that this field would help with data filtering at read/query
time, say select certain projection at minute 10-15, to speed up the read

Thanks for the info., I will follow them.

On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <dan@cloudera.com> wrote:

> Hey Sand,
> Sorry for the delayed response.  I'm not quite following your use case.
> Is the requirement to pre-aggregate by time window? I don't think Kudu can
> help you directly with that (nothing built in), but you could always create
> a separate table to store the pre-aggregated values.  As far as applying
> functions to do row splits, that is an interesting idea, but I think once
> Kudu has support for range bounds (the non-covering range partition design
> doc linked above), you can simply create the bounds where the function
> would have put them.  For example, if you want a partition for every five
> minutes, you can create the bounds accordingly.
> Earlier this week I gave a talk on timeseries in Kudu, I've included some
> slides that may be interesting to you.  Additionally, you may want to check
> out https://github.com/danburkert/kudu-ts, it's a very young  (not
> feature complete) metrics layer on top of Kudu, it may give you some ideas.
> - Dan
> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>> system works.
>> As for things in my mind. Take the schema of <host,metric,time,...>, say,
>> I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at
>> 5 mins interval for the past 3 days, 7 days, ... Looks like I need to
>> introduce a special 5-min bar column, use that column to do range partition
>> to spread data across the tablet servers so that I could leverage parallel
>> filtering.
>> The cost of this extra column (INT8) is not ideal but not too bad either
>> (storage cost wise, compression should do wonders). So I am thinking
>> whether it would be better to take "functions" as row split instead of only
>> constants. Of course if business requires to drop down to 1-min bar, the
>> data has to be re-sharded again. So a more cost effective way of doing this
>> on a production cluster would be good.
>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com> wrote:
>>> Hi Sand,
>>> I've been working on some diagrams to help explain some of the more
>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>> point, but the goal is to clean it up and move it into the Kudu
>>> documentation proper.  I'm interested to hear what kind of time series you
>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>> series, you can follow progress here
>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>> additional ideas I'd love to hear them.  You may also be interested in a
>>> small project that a JD and I have been working on in the past week to
>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>> at this point.
>>> - Dan
>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com>
>>> wrote:
>>>> Thanks. Will read.
>>>> Given that I am researching time series data, row locality is crucial
>>>> :-)
>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>> We do have non-covering range partitions coming in the next few
>>>>> months, here's the design (in review):
>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>> The "Background & Motivation" section should give you a good idea
>>>>> why I'm mentioning this.
>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>> could be good enough.
>>>>> J-D
>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>> wrote:
>>>>>> Makes sense.
>>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>>> after the table is created. Now, I have to "think ahead" to pre-create
>>>>>> range buckets.
>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>> jdcryans@apache.org> wrote:
>>>>>>> You will only get 1 tablet and no data distribution, which is
>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>> data and eventually you'll get some data distribution even if
it doesn't
>>>>>>> start in an ideal situation. Tablet splitting will come later
for Kudu.
>>>>>>> J-D
>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>> wrote:
>>>>>>>> One more questions, how does the range partition work if
I don't
>>>>>>>> specify the split rows?
>>>>>>>> Thanks!
>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
>>>>>>>>> unclear how the range partition column names associated
with the partial
>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>> Hi Sand,
>>>>>>>>>> Please have a look at
>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>> Thanks,
>>>>>>>>>> Misty
>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split
rows work. I know
>>>>>>>>>>> from some docs, this is currently for pre-creation
the table. I am
>>>>>>>>>>> researching how to partition (hash+range) some
time series test data.
>>>>>>>>>>> Is there an example? or notes somewhere I could
read upon.
>>>>>>>>>>> Thanks much.

View raw message