kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: kudu table design question
Date Fri, 24 Feb 2017 01:40:02 GMT
Hi Tenny,

First off, how many tablet servers are in your cluster?  16 partitions is
appropriate for one or maybe two tablet servers, so if your cluster is
bigger you could try bumping the number of partitions.

Second, the schemas don't look identical, you have an additional 'id'
column in the Kudu table, and crucially, it doesn't have any predicates, so
this query is doing a full table scan.

Finally, the Parquet table is likely able to take advantage of significant
partition pruning due to the between clause.  An equivalent in Kudu would
be range partitioning on the print_date_id.  You might try doing the same
for Kudu.

- Dan

On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <tennysusanto@gmail.com>
wrote:

> I have a table (call this fact_table)  that I want to create in kudu.
>
> I have an equivalent table in impala/parquet that is partitioned by
> day_id.
>
> create table impala_fact_table (
> company_id INT,
> transcount INT)
> partitioned by
> (print_date_id INT)
> STORED AS PARQUET;
>
> so a common query would be:
>
> select  sum(transcount)
> from impala_fact_table f
> join with company_dim c on f.company_id = c.company_id
> where c.company_id in (123,456)
> and f.print_date_id between 20170101 and 20170202
>
> I created an equivalent of the fact table in kudu:
>
> CREATE TABLE kudu_fact_table  (
> id STRING,
> print_date_id,
> company_id INT,
> transcount INT)
> PRIMARY KEY(id,print_date_id)
> ) PARTITION BY HASH PARTITIONS 16
> )
> STORED AS KUDU
> TBLPROPERTIES(
>   'kudu.table_name' = 'kudu_fact_table',
>   'kudu.master_addresses' = 'myserver:7051'
> );
>
> But the performance of the join with this kudu table is terrible, 2 secs
> with impala table vs 126 secs with kudu table.
>
> select  sum(transcount)
> from kudu_fact_table f
> join with company_dim c on f.company_id = c.company_id
> where c.company_id in (123,456)
> and f.print_date_id between 20170101 and 20170202
>
> How should I design my kudu table so performance is somewhat comparable?
>
>

Mime
View raw message