Thanks for sharing, Dan. The diagrams explained clearly how the current system works. 

As for things in my mind. Take the schema of <host,metric,time,...>, say, I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I need to introduce a special 5-min bar column, use that column to do range partition to spread data across the tablet servers so that I could leverage parallel filtering. 

The cost of this extra column (INT8) is not ideal but not too bad either (storage cost wise, compression should do wonders). So I am thinking whether it would be better to take "functions" as row split instead of only constants. Of course if business requires to drop down to 1-min bar, the data has to be re-sharded again. So a more cost effective way of doing this on a production cluster would be good. 




On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com> wrote:
Hi Sand,

I've been working on some diagrams to help explain some of the more advanced partitioning types, it's attached.   Still pretty rough at this point, but the goal is to clean it up and move it into the Kudu documentation proper.  I'm interested to hear what kind of time series you are interested in Kudu for.  I'm tasked with improving Kudu for time series, you can follow progress here. If you have any additional ideas I'd love to hear them.  You may also be interested in a small project that a JD and I have been working on in the past week to build an OpenTSDB style store on top of Kudu, you can find it here.  Still quite feature limited at this point.

- Dan

On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
Thanks. Will read. 

Given that I am researching time series data, row locality is crucial :-)  

On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
We do have non-covering range partitions coming in the next few months, here's the design (in review): http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md

The "Background & Motivation" section should give you a good idea of why I'm mentioning this.

Meanwhile, if you don't need row locality, using hash partitioning could be good enough.

J-D

On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
Makes sense. 

Yeah it would be cool if users could specify/control the split rows after the table is created. Now, I have to "think ahead" to pre-create the range buckets. 

On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
You will only get 1 tablet and no data distribution, which is bad.

That's also how HBase works, but it will split regions as you insert data and eventually you'll get some data distribution even if it doesn't start in an ideal situation. Tablet splitting will come later for Kudu.

J-D

On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
One more questions, how does the range partition work if I don't specify the split rows? 

Thanks! 

On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
Thanks, Misty. The "advanced" impala example helped. 

I was just reading the Java API,CreateTableOptions.java, it's unclear how the range partition column names associated with the partial rows params in the addSplitRow API.

On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <mstanleyjones@cloudera.com> wrote:
Hi Sand,

Please have a look at http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables and see if it is helpful to you.

Thanks,
Misty

On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com> wrote:
Hi, I am new to Kudu. I wonder how the split rows work. I know from some docs, this is currently for pre-creation the table. I am researching how to partition (hash+range) some time series test data. 

Is there an example? or notes somewhere I could read upon. 

Thanks much.