kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: kudu table design question
Date Fri, 24 Feb 2017 21:34:12 GMT
Hi Tenny,

1000 partitions is on the upper end of what I'd recommend - with 3x
replication that's 125 tablet replicas per tablet server (something more
like 20 or 30 would be ideal depending on hardware).  How much data does
each day have?  I would aim for tablet size on the order of 50GiB, so if
it's not that much per day you could try making week or month wide
partitions.  Just bumping the number of partitions and being able to take
advantage of partition pruning should improve the performance tremendously.

In the next release we're adding support for pushdown IN list predicates,
which could help your query even more if you could put company_id as the
first component of your primary key.  That being said, I think improved
range partition will likely give the most dramatic improvements, and
there's no need to wait.

Week wide range partitions can be specified like:

 PARTITION 20170101 <= VALUES < 20170108,
 PARTITION 20170108 <= VALUES < 20170115,
 ...


- Dan

On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <tennysusanto@gmail.com>
wrote:

> I have 24 tablet servers.
>
> I added an id column because I needed a unique column to be the primary
> key as kudu required primary key to be specified.  My original table
> actually has 20 columns with no single primary key column. I concatenated 5
> of them to build a unique id column which I made it as part of the primary
> key. I have tried specifying 5 columns to be the primary key but I noticed
> the inserts were much slower, so I tried with just 2 columns as primary key
> instead, seems to improve insert speed.
>
> So this is my new schema and will measure query speed with it. If I
> partition by day, is 1000 partitions too many? What is the recommended
> maximum limit in the number of partitions kudu can handle?
>
> CREATE TABLE kudu_fact_table  (
> print_date_id,
> id STRING,
> company_id INT,
> transcount INT)
> PRIMARY KEY(print_date_id,id)
> ) PARTITION BY RANGE (print_date_id)
> (
>   PARTITION VALUE = 20170101,
>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
> year's worth of data)
>  )
> STORED AS KUDU
> TBLPROPERTIES(
>   'kudu.table_name' = 'kudu_fact_table',
>   'kudu.master_addresses' = 'myserver:7051'
> );
>
>
>
> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> I'd add that moving the print_date_id to the beginning of the primary key
>> in the Kudu fact table would allow each server to do a range scan instead
>> of a full scan.
>>
>> -Todd
>>
>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <dan@cloudera.com> wrote:
>>
>>> Hi Tenny,
>>>
>>> First off, how many tablet servers are in your cluster?  16 partitions
>>> is appropriate for one or maybe two tablet servers, so if your cluster is
>>> bigger you could try bumping the number of partitions.
>>>
>>> Second, the schemas don't look identical, you have an additional 'id'
>>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>>> this query is doing a full table scan.
>>>
>>> Finally, the Parquet table is likely able to take advantage of
>>> significant partition pruning due to the between clause.  An equivalent in
>>> Kudu would be range partitioning on the print_date_id.  You might try doing
>>> the same for Kudu.
>>>
>>> - Dan
>>>
>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <tennysusanto@gmail.com>
>>> wrote:
>>>
>>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>>
>>>> I have an equivalent table in impala/parquet that is partitioned by
>>>> day_id.
>>>>
>>>> create table impala_fact_table (
>>>> company_id INT,
>>>> transcount INT)
>>>> partitioned by
>>>> (print_date_id INT)
>>>> STORED AS PARQUET;
>>>>
>>>> so a common query would be:
>>>>
>>>> select  sum(transcount)
>>>> from impala_fact_table f
>>>> join with company_dim c on f.company_id = c.company_id
>>>> where c.company_id in (123,456)
>>>> and f.print_date_id between 20170101 and 20170202
>>>>
>>>> I created an equivalent of the fact table in kudu:
>>>>
>>>> CREATE TABLE kudu_fact_table  (
>>>> id STRING,
>>>> print_date_id,
>>>> company_id INT,
>>>> transcount INT)
>>>> PRIMARY KEY(id,print_date_id)
>>>> ) PARTITION BY HASH PARTITIONS 16
>>>> )
>>>> STORED AS KUDU
>>>> TBLPROPERTIES(
>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>> );
>>>>
>>>> But the performance of the join with this kudu table is terrible, 2
>>>> secs with impala table vs 126 secs with kudu table.
>>>>
>>>> select  sum(transcount)
>>>> from kudu_fact_table f
>>>> join with company_dim c on f.company_id = c.company_id
>>>> where c.company_id in (123,456)
>>>> and f.print_date_id between 20170101 and 20170202
>>>>
>>>> How should I design my kudu table so performance is somewhat comparable?
>>>>
>>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Regards,
>
> Tenny Susanto
>
>

Mime
View raw message