kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: kudu table design question
Date Fri, 24 Feb 2017 02:29:18 GMT
I'd add that moving the print_date_id to the beginning of the primary key
in the Kudu fact table would allow each server to do a range scan instead
of a full scan.

-Todd

On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <dan@cloudera.com> wrote:

> Hi Tenny,
>
> First off, how many tablet servers are in your cluster?  16 partitions is
> appropriate for one or maybe two tablet servers, so if your cluster is
> bigger you could try bumping the number of partitions.
>
> Second, the schemas don't look identical, you have an additional 'id'
> column in the Kudu table, and crucially, it doesn't have any predicates, so
> this query is doing a full table scan.
>
> Finally, the Parquet table is likely able to take advantage of significant
> partition pruning due to the between clause.  An equivalent in Kudu would
> be range partitioning on the print_date_id.  You might try doing the same
> for Kudu.
>
> - Dan
>
> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <tennysusanto@gmail.com>
> wrote:
>
>> I have a table (call this fact_table)  that I want to create in kudu.
>>
>> I have an equivalent table in impala/parquet that is partitioned by
>> day_id.
>>
>> create table impala_fact_table (
>> company_id INT,
>> transcount INT)
>> partitioned by
>> (print_date_id INT)
>> STORED AS PARQUET;
>>
>> so a common query would be:
>>
>> select  sum(transcount)
>> from impala_fact_table f
>> join with company_dim c on f.company_id = c.company_id
>> where c.company_id in (123,456)
>> and f.print_date_id between 20170101 and 20170202
>>
>> I created an equivalent of the fact table in kudu:
>>
>> CREATE TABLE kudu_fact_table  (
>> id STRING,
>> print_date_id,
>> company_id INT,
>> transcount INT)
>> PRIMARY KEY(id,print_date_id)
>> ) PARTITION BY HASH PARTITIONS 16
>> )
>> STORED AS KUDU
>> TBLPROPERTIES(
>>   'kudu.table_name' = 'kudu_fact_table',
>>   'kudu.master_addresses' = 'myserver:7051'
>> );
>>
>> But the performance of the join with this kudu table is terrible, 2 secs
>> with impala table vs 126 secs with kudu table.
>>
>> select  sum(transcount)
>> from kudu_fact_table f
>> join with company_dim c on f.company_id = c.company_id
>> where c.company_id in (123,456)
>> and f.print_date_id between 20170101 and 20170202
>>
>> How should I design my kudu table so performance is somewhat comparable?
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message