kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: Partition and Split rows
Date Thu, 12 May 2016 18:12:32 GMT
On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sand.m.stone@gmail.com> wrote:

> > Is the requirement to pre-aggregate by time window?
> No, I am thinking to create a column say, "minute". It's basically the
> minute field of the timestamp column(even round to 5-min bucket depending
> on the needs). So it's a computed column being filled in on data ingestion.
> My goal is that this field would help with data filtering at read/query
> time, say select certain projection at minute 10-15, to speed up the read
> queries.
>

In many cases, Kudu can do his for you without having to add special
columns.  The requirements are that the timestamp is part of the primary
key, and any columns that come before the timestamp in the primary key (if
it's a compound PK), have equality predicates.  So for instance, if you
create a table such as:

CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);

then queries such as

SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
2016-05-01T00:00 AND time < 2016-05-01T00:05

Then only the data for that 5 minute time window will be read from disk.
If the query didn't have the equality predicate on the 'metric' column,
then it would do a much bigger scan + filter operation.  If you want more
background on how this is achieved, check out the partition pruning design
doc:
https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
.

- Dan



> Thanks for the info., I will follow them.
>
> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <dan@cloudera.com> wrote:
>
>> Hey Sand,
>>
>> Sorry for the delayed response.  I'm not quite following your use case.
>> Is the requirement to pre-aggregate by time window? I don't think Kudu can
>> help you directly with that (nothing built in), but you could always create
>> a separate table to store the pre-aggregated values.  As far as applying
>> functions to do row splits, that is an interesting idea, but I think once
>> Kudu has support for range bounds (the non-covering range partition design
>> doc linked above), you can simply create the bounds where the function
>> would have put them.  For example, if you want a partition for every five
>> minutes, you can create the bounds accordingly.
>>
>> Earlier this week I gave a talk on timeseries in Kudu, I've included some
>> slides that may be interesting to you.  Additionally, you may want to check
>> out https://github.com/danburkert/kudu-ts, it's a very young  (not
>> feature complete) metrics layer on top of Kudu, it may give you some ideas.
>>
>> - Dan
>>
>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.stone@gmail.com>
>> wrote:
>>
>>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>>> system works.
>>>
>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>> need to introduce a special 5-min bar column, use that column to do range
>>> partition to spread data across the tablet servers so that I could leverage
>>> parallel filtering.
>>>
>>> The cost of this extra column (INT8) is not ideal but not too bad either
>>> (storage cost wise, compression should do wonders). So I am thinking
>>> whether it would be better to take "functions" as row split instead of only
>>> constants. Of course if business requires to drop down to 1-min bar, the
>>> data has to be re-sharded again. So a more cost effective way of doing this
>>> on a production cluster would be good.
>>>
>>>
>>>
>>>
>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com> wrote:
>>>
>>>> Hi Sand,
>>>>
>>>> I've been working on some diagrams to help explain some of the more
>>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>>> point, but the goal is to clean it up and move it into the Kudu
>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>> series, you can follow progress here
>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>>> additional ideas I'd love to hear them.  You may also be interested in a
>>>> small project that a JD and I have been working on in the past week to
>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>>> at this point.
>>>>
>>>> - Dan
>>>>
>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks. Will read.
>>>>>
>>>>> Given that I am researching time series data, row locality is crucial
>>>>> :-)
>>>>>
>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> We do have non-covering range partitions coming in the next few
>>>>>> months, here's the design (in review):
>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>
>>>>>> The "Background & Motivation" section should give you a good
idea of
>>>>>> why I'm mentioning this.
>>>>>>
>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>> could be good enough.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Makes sense.
>>>>>>>
>>>>>>> Yeah it would be cool if users could specify/control the split
rows
>>>>>>> after the table is created. Now, I have to "think ahead" to pre-create
the
>>>>>>> range buckets.
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>>> You will only get 1 tablet and no data distribution, which
is bad.
>>>>>>>>
>>>>>>>> That's also how HBase works, but it will split regions as
you
>>>>>>>> insert data and eventually you'll get some data distribution
even if it
>>>>>>>> doesn't start in an ideal situation. Tablet splitting will
come later for
>>>>>>>> Kudu.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> One more questions, how does the range partition work
if I don't
>>>>>>>>> specify the split rows?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>
>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
it's
>>>>>>>>>> unclear how the range partition column names associated
with the partial
>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones
<
>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>
>>>>>>>>>>> Please have a look at
>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Misty
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split
rows work. I know
>>>>>>>>>>>> from some docs, this is currently for pre-creation
the table. I am
>>>>>>>>>>>> researching how to partition (hash+range)
some time series test data.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there an example? or notes somewhere I
could read upon.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message