pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mingda li <limingda1...@gmail.com>
Subject Re: How to test the efficiency of multiple join
Date Wed, 07 Dec 2016 18:43:50 GMT
So I need to test something like count (*)?  To use count, I need the
following query:

Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;
Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog    _returns BY (cr_item_sk, cr_order_number);
b = foreach Bad_OrderRes generate cr_returned_date_sk, cr_returned_time_sk,
    cr_item_sk, cr_refunded_customer_sk, cs_sold_date_sk, cs_sold_time_sk,
cs_it    em_sk, cs_order_number, inv_date_sk, inv_item_sk,
inv_warehouse_sk, inv_quan    tity_on_hand;

b_group=GROUP b ALL;
b_count=FOREACH b_group GENERATE COUNT(b);
Dump b_count;

or use

ones = FOREACH b GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count

I think the second method can save more time.



On Wed, Dec 7, 2016 at 9:49 AM, Rohini Palaniswamy <rohini.aditya@gmail.com>
wrote:

> Limit 4 would make processing of join stop after 4 records. It is not a
> good idea to add it if you are testing performance of join.
>
> On Tue, Dec 6, 2016 at 8:13 PM mingda li <limingda1993@gmail.com> wrote:
>
> > Thanks for your quick reply. If so, I can use the limit operator to
> compare
> >
> > good and bad join plan. It takes time to dump all.
> >
> >
> >
> > Bests,
> >
> > Mingda
> >
> >
> >
> >
> >
> > On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <liyun.zhang@intel.com>
> > wrote:
> >
> >
> >
> > > Hi:
> >
> > >    I think the query time about multiple join part is not related with
> > the
> >
> > > number of limit operator(in your case the number is 4). When the query
> is
> >
> > > executed, limit_data is executed after Bad_OrderRes, after join
> >
> > > (Bad_OrderRes) is finished, limit(limit_data) starts.
> >
> > > If I have missed something, please tell me.
> >
> > >
> >
> > >
> >
> > > Best Regards
> >
> > > Kelly Zhang/Zhang,Liyun
> >
> > >
> >
> > >
> >
> > >
> >
> > > -----Original Message-----
> >
> > > From: mingda li [mailto:limingda1993@gmail.com]
> >
> > > Sent: Wednesday, December 7, 2016 8:18 AM
> >
> > > To: dev@pig.apache.org; user@pig.apache.org
> >
> > > Subject: How to test the efficiency of multiple join
> >
> > >
> >
> > > Dear all,
> >
> > >
> >
> > > I want to test the different multiple join orders' efficiency. However,
> >
> > > since the pig query is executed lazily, I need to use dump or store to
> > let
> >
> > > the query be executed.
> >
> > >
> >
> > > Now, I use the following query to test the efficiency.
> >
> > >
> >
> > > *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
> >
> > > cs_item_sk;*
> >
> > > *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
> >
> > > catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
> >
> > > Bad_OrderRes 4; * *Dump limit_data;*
> >
> > >
> >
> > > Do you think this is OK to just show 4 of results? Could this query
> >
> > > execution time represent the efficiency of multilpe join? I am not sure
> > if
> >
> > > it will just get 4 items and stop without executing other items.
> >
> > >
> >
> > > Bests,
> >
> > > Mingda
> >
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message