kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sand Stone <sand.m.st...@gmail.com>
Subject Re: Partition and Split rows
Date Thu, 12 May 2016 18:39:04 GMT
Thanks, Dan.

In your scheme, I assume you suggest the range partition on the timestamp.
I don't know how Kudu load balance the data across the tablet servers. For
example, do I need to pre-calculate every day, a list of 5 minutes apart
timestamps at table creation? [assume I have to create a new table every

My hope, with the additional 5-min column, and use it as the range
partition column, is that so I could spread the data evenly across the
tablet servers. Once the partition level deletion works, I don't need to
re-create a table.
Also, since 5-min interval data are always colocated together, the read
query could be efficient too.

P.S: there are some cases I would like to compute aggregations across all
metrics at 5-min intervals.

On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <dan@cloudera.com> wrote:

> Forgot to add the PK specification to the CREATE TABLE, it should have
> read as follows:
> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
> PRIMARY KEY (metric, time);
> - Dan
> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <dan@cloudera.com> wrote:
>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sand.m.stone@gmail.com>
>> wrote:
>>> > Is the requirement to pre-aggregate by time window?
>>> No, I am thinking to create a column say, "minute". It's basically the
>>> minute field of the timestamp column(even round to 5-min bucket depending
>>> on the needs). So it's a computed column being filled in on data ingestion.
>>> My goal is that this field would help with data filtering at read/query
>>> time, say select certain projection at minute 10-15, to speed up the read
>>> queries.
>> In many cases, Kudu can do his for you without having to add special
>> columns.  The requirements are that the timestamp is part of the primary
>> key, and any columns that come before the timestamp in the primary key (if
>> it's a compound PK), have equality predicates.  So for instance, if you
>> create a table such as:
>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>> then queries such as
>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>> Then only the data for that 5 minute time window will be read from disk.
>> If the query didn't have the equality predicate on the 'metric' column,
>> then it would do a much bigger scan + filter operation.  If you want more
>> background on how this is achieved, check out the partition pruning design
>> doc:
>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>> .
>> - Dan
>>> Thanks for the info., I will follow them.
>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <dan@cloudera.com> wrote:
>>>> Hey Sand,
>>>> Sorry for the delayed response.  I'm not quite following your use
>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>> Kudu can help you directly with that (nothing built in), but you could
>>>> always create a separate table to store the pre-aggregated values.  As far
>>>> as applying functions to do row splits, that is an interesting idea, but
>>>> think once Kudu has support for range bounds (the non-covering range
>>>> partition design doc linked above), you can simply create the bounds where
>>>> the function would have put them.  For example, if you want a partition for
>>>> every five minutes, you can create the bounds accordingly.
>>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>>> some slides that may be interesting to you.  Additionally, you may want to
>>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>>> ideas.
>>>> - Dan
>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.stone@gmail.com>
>>>> wrote:
>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>> current system works.
>>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like
>>>>> need to introduce a special 5-min bar column, use that column to do range
>>>>> partition to spread data across the tablet servers so that I could leverage
>>>>> parallel filtering.
>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>> whether it would be better to take "functions" as row split instead of
>>>>> constants. Of course if business requires to drop down to 1-min bar,
>>>>> data has to be re-sharded again. So a more cost effective way of doing
>>>>> on a production cluster would be good.
>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com>
>>>>>> Hi Sand,
>>>>>> I've been working on some diagrams to help explain some of the more
>>>>>> advanced partitioning types, it's attached.   Still pretty rough
at this
>>>>>> point, but the goal is to clean it up and move it into the Kudu
>>>>>> documentation proper.  I'm interested to hear what kind of time series
>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>> series, you can follow progress here
>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have
>>>>>> additional ideas I'd love to hear them.  You may also be interested
in a
>>>>>> small project that a JD and I have been working on in the past week
>>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature
>>>>>> limited at this point.
>>>>>> - Dan
>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>> wrote:
>>>>>>> Thanks. Will read.
>>>>>>> Given that I am researching time series data, row locality is
>>>>>>> crucial :-)
>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>> We do have non-covering range partitions coming in the next
>>>>>>>> months, here's the design (in review):
>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>> The "Background & Motivation" section should give you
a good idea
>>>>>>>> of why I'm mentioning this.
>>>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>>>> could be good enough.
>>>>>>>> J-D
>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Makes sense.
>>>>>>>>> Yeah it would be cool if users could specify/control
the split
>>>>>>>>> rows after the table is created. Now, I have to "think
ahead" to pre-create
>>>>>>>>> the range buckets.
>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>> You will only get 1 tablet and no data distribution,
which is bad.
>>>>>>>>>> That's also how HBase works, but it will split regions
as you
>>>>>>>>>> insert data and eventually you'll get some data distribution
even if it
>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting
will come later for
>>>>>>>>>> Kudu.
>>>>>>>>>> J-D
>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>> One more questions, how does the range partition
work if I don't
>>>>>>>>>>> specify the split rows?
>>>>>>>>>>> Thanks!
>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example
>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
>>>>>>>>>>>> unclear how the range partition column names
associated with the partial
>>>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones
>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Misty
>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand
Stone <
>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how
the split rows work. I
>>>>>>>>>>>>>> know from some docs, this is currently
for pre-creation the table. I am
>>>>>>>>>>>>>> researching how to partition (hash+range)
some time series test data.
>>>>>>>>>>>>>> Is there an example? or notes somewhere
I could read upon.
>>>>>>>>>>>>>> Thanks much.

View raw message