impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mostafa Mokhtar <>
Subject Re: performance issue on big table join
Date Fri, 27 Oct 2017 04:02:09 GMT

Looks like you are joining store_sales with catalog_sales on item_sk, this
kind of join condition is a many to many, which means the output number of
rows will be much larger then input number of rows, not sure if this is

Also did you run "compute stats [TABLE_NAME]" on both tables?

For a more comprehensive query try TPCDS Q17

select  i_item_id



       ,count(ss_quantity) as store_sales_quantitycount

       ,avg(ss_quantity) as store_sales_quantityave

       ,stddev_samp(ss_quantity) as store_sales_quantitystdev

       ,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov

       ,count(sr_return_quantity) as store_returns_quantitycount

       ,avg(sr_return_quantity) as store_returns_quantityave

       ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev

       ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as

       ,count(cs_quantity) as catalog_sales_quantitycount
,avg(cs_quantity) as catalog_sales_quantityave

       ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev

       ,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov

 from store_sales



     ,date_dim d1

     ,date_dim d2

     ,date_dim d3



 where d1.d_quarter_name = '2000Q1'

   and d1.d_date_sk = ss_sold_date_sk

   and i_item_sk = ss_item_sk

   and s_store_sk = ss_store_sk

   and ss_customer_sk = sr_customer_sk

   and ss_item_sk = sr_item_sk

   and ss_ticket_number = sr_ticket_number

   and sr_returned_date_sk = d2.d_date_sk

   and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')

   and sr_customer_sk = cs_bill_customer_sk

   and sr_item_sk = cs_item_sk

   and cs_sold_date_sk = d3.d_date_sk

   and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')

 group by i_item_id



 order by i_item_id



limit 100;

I recommend moving this kind of discussion on

On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <> wrote:

> The profile file is damaged. Here is a screenshot for exec summary
> ​
> 2017-10-27 10:04 GMT+08:00 俊杰陈 <>:
>> Hi Devs
>> I met a performance issue on big table join. The query takes more than 3
>> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> cluster. when running query,  the left scanner and exchange node are very
>> slow.  Did I miss some key arguments?
>> you can see profile file in attachment.
>> ​
>> --
>> Thanks & Best Regards
> --
> Thanks & Best Regards

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message