Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of lalamchinnarao13@gmail.com
 designates 209.85.220.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAL1MvVz5gQhZkOnJ1k1GFD0FYizuDELJpxtNCLk09T_xsuAsdA@mail.gmail.com>
References: 
 <CAL1MvVz5gQhZkOnJ1k1GFD0FYizuDELJpxtNCLk09T_xsuAsdA@mail.gmail.com>
Date: Fri, 28 Mar 2014 15:30:39 +0530
Message-ID: 
 <CAH6hrJOBy+jWF2NB=s+KHmgy9vrZ6YPBzh9GE_MTiQOGkF24cw@mail.gmail.com>
Subject: Re: optimize hive query for multitable join where one table is huge
From: Chinna Rao Lalam <lalamchinnarao13@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=001a11339e2e2271de04f5a7c64a

--001a11339e2e2271de04f5a7c64a
Content-Type: text/plain; charset=ISO-8859-1

Hi,

In hive different types of joins are there like join, map join , bucket map
join and etc.

Please take a look of these it may help you to optimize your query

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919


Hope It Helps,
Chinna


On Thu, Mar 27, 2014 at 4:24 AM, Srinivasan Ramaswamy <ursvasan@gmail.com>wrote:

> I have a join query where i am joining huge tables and i am trying to
> optimize this hive query.
>
>     INSERT OVERWRITE TABLE result
>     SELECT /*+ STREAMTABLE(product) */
>     i.IMAGE_ID,
>     p.PRODUCT_NO,
>     p.STORE_NO,
>     p.PRODUCT_CAT_NO,
>     p.CAPTION,
>     p.PRODUCT_DESC,
>     p.IMAGE1_ID,
>     p.IMAGE2_ID,
>     s.STORE_ID,
>     s.STORE_NAME,
>     p.CREATE_DATE,
>     CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
>     CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
>     CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
>     FROM image i
>     JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
>     JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
>     JOIN STORE s ON p.STORE_NO = s.STORE_NO
>     JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
>     LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID =
> custImg.IMAGE_ID
>      LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID =
> custImg1.IMAGE_ID
>     LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID =
> custImg2.IMAGE_ID;
>
> Here are some facts about the tables
> image table has 60 million rows
> product table has 1 billion rows
> product_cat has 1000 rows
> store has 1m rows
> stock_info has 100 rows
> customizable_image has 200k rows
>
> a product can have one or two images (image1 and image2) and product level
> information are stored only in product table. i tried moving the join with
> product to the bottom but i couldnt as all other following joins require
> data from the product table.
>
> Here is what i tried so far:
> 1. I gave the hint to hive to stream product table as its the biggest one
> 2. I bucketed the table (during create table of image and product) into
> 256 buckets (on image_id) and then did the join - didnt give me any
> significant performance gain
> 3. changed the input format to sequence file from textfile(gzip files) ,
> so that it can be splittable and hence more mappers can be run if hive want
> to run more mappers
>
> The query is still taking longer than 5 hours in Hive (running in aws with
> 3 large nodes) where as in RDBMS it takes only 5 hrs. I need some help in
> optimizing this query, so that it executes much faster. what else can i
> try, does partitioning the table help in improving join performance ?
>
> This brings me to the question, "is Hive even the right choice (compared
> to rdbms) for such complex joins" ?
>
> Thanks
> Srini
>

--001a11339e2e2271de04f5a7c64a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div><div>In hive different types of joi=
ns are there like join, map join , bucket map join and etc.</div><div><br><=
/div><div>Please take a look of these it may help you to optimize your quer=
y</div>
<div><br></div><div><a href=3D"https://cwiki.apache.org/confluence/display/=
Hive/LanguageManual+JoinOptimization">https://cwiki.apache.org/confluence/d=
isplay/Hive/LanguageManual+JoinOptimization</a></div><div><br></div><div>
<a href=3D"https://www.facebook.com/notes/facebook-engineering/join-optimiz=
ation-in-apache-hive/470667928919">https://www.facebook.com/notes/facebook-=
engineering/join-optimization-in-apache-hive/470667928919</a></div></div>
<div><br></div><div><br></div><div>Hope It Helps,</div><div>Chinna</div></d=
iv><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu, Ma=
r 27, 2014 at 4:24 AM, Srinivasan Ramaswamy <span dir=3D"ltr">&lt;<a href=
=3D"mailto:ursvasan@gmail.com" target=3D"_blank">ursvasan@gmail.com</a>&gt;=
</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>I have a join query wh=
ere i am joining huge tables and i am trying to optimize this hive query.=
=A0</div>
<div><br></div><div>=A0 =A0 INSERT OVERWRITE TABLE result</div><div>=A0 =A0=
 SELECT /*+ STREAMTABLE(product) */</div>

<div>=A0 =A0 i.IMAGE_ID,=A0</div><div>=A0 =A0 p.PRODUCT_NO,</div><div>=A0 =
=A0 p.STORE_NO,</div><div>=A0 =A0 p.PRODUCT_CAT_NO,</div><div>=A0 =A0 p.CAP=
TION,</div><div>=A0 =A0 p.PRODUCT_DESC,</div><div>=A0 =A0 p.IMAGE1_ID,</div=
><div>=A0 =A0 p.IMAGE2_ID,=A0</div>


<div>=A0 =A0 s.STORE_ID,=A0</div><div>=A0 =A0 s.STORE_NAME,=A0</div><div>=
=A0 =A0 p.CREATE_DATE,=A0</div><div>=A0 =A0 CASE WHEN custImg.IMAGE_ID is N=
ULL THEN 0 ELSE 1 END,=A0</div><div>=A0 =A0 CASE WHEN custImg1.IMAGE_ID is =
NULL THEN 0 ELSE 1 END,=A0</div>


<div>=A0 =A0 CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END</div><di=
v>=A0 =A0 FROM image i =A0</div><div>=A0 =A0 JOIN PRODUCT p ON i.IMAGE_ID =
=3D p.IMAGE1_ID</div><div>=A0 =A0 JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO=
 =3D pcat.PRODUCT_CAT_NO</div>


<div>=A0 =A0 JOIN STORE s ON p.STORE_NO =3D s.STORE_NO</div><div>=A0 =A0 JO=
IN STOCK_INFO si ON si.STOCK_INFO_ID =3D pcat.STOCK_INFO_ID=A0</div><div>=
=A0 =A0 LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID =3D custIm=
g.IMAGE_ID=A0</div>

<div>
=A0 =A0 LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID =3D cust=
Img1.IMAGE_ID=A0</div><div>=A0 =A0 LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custI=
mg2 ON p.IMAGE2_ID =3D custImg2.IMAGE_ID;</div><div><br></div><div>Here are=
 some facts about the tables</div>


<div>image table has 60 million rows</div><div>product table has 1 billion =
rows</div><div>product_cat has 1000 rows</div><div>store has 1m rows</div><=
div>stock_info has 100 rows</div><div>customizable_image has 200k rows</div=
>


<div><br></div><div>a product can have one or two images (image1 and image2=
) and product level information are stored only in product table. i tried m=
oving the join with product to the bottom but i couldnt as all other follow=
ing joins require data from the product table.</div>


<div><br></div><div>Here is what i tried so far:=A0</div><div>1. I gave the=
 hint to hive to stream product table as its the biggest one</div><div>2. I=
 bucketed the table (during create table of image and product) into 256 buc=
kets (on image_id) and then did the join - didnt give me any significant pe=
rformance gain</div>


<div>3. changed the input format to sequence file from textfile(gzip files)=
 , so that it can be splittable and hence more mappers can be run if hive w=
ant to run more mappers</div><div><br></div><div>The query is still taking =
longer than 5 hours in Hive (running in aws with 3 large nodes) where as in=
 RDBMS it takes only 5 hrs. I need some help in optimizing this query, so t=
hat it executes much faster. what else can i try, does partitioning the tab=
le help in improving join performance ?</div>


<div><br></div><div>This brings me to the question, &quot;is Hive even the =
right choice (compared to rdbms) for such complex joins&quot; ?</div><div><=
br></div><div>Thanks</div><div>Srini</div></div>
</blockquote></div><br></div>

--001a11339e2e2271de04f5a7c64a--