Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1211710DDA for ; Fri, 28 Mar 2014 10:01:14 +0000 (UTC) Received: (qmail 11461 invoked by uid 500); 28 Mar 2014 10:01:11 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 10867 invoked by uid 500); 28 Mar 2014 10:01:09 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 10842 invoked by uid 99); 28 Mar 2014 10:01:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2014 10:01:06 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lalamchinnarao13@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vc0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2014 10:01:01 +0000 Received: by mail-vc0-f170.google.com with SMTP id hu19so5751837vcb.1 for ; Fri, 28 Mar 2014 03:00:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=0/neQKgjKE0MabQJvp/eycHt5aTrsQqa/KFn97+IZ8M=; b=W5yb3dj1FJandOJAZCSBf+n0XIE5GYzz0NAqov0RICre7/JFKJV3XFy1l7oX4r9f7V 7BhtyhpqBzStQQNIpJLvuH9ohzWiv2mfiDEmmei+6yDsefRjEHwYTdoWEvss1i4YGbwI yzxLuSu7L39P9epNZOpLzZw5iAvn6t75ZIGUKTz/4gU6pAUzVkgB8AL4yYx5hSkrt952 Oiry95PvuyLmlINIdUg7ZbLTSIXxjMTWoWYPTF7QjctToc7m4Sfd1mpMjEJGTHNB1sbt bFbENwfuzf5QQ22j3biK5TN/zg93FQD3razBF06ZHVZ+pHjfZzf/zdTIPb9LLnUdb1HR 3tLA== MIME-Version: 1.0 X-Received: by 10.221.20.199 with SMTP id qp7mr86285vcb.24.1396000839330; Fri, 28 Mar 2014 03:00:39 -0700 (PDT) Received: by 10.221.47.132 with HTTP; Fri, 28 Mar 2014 03:00:39 -0700 (PDT) In-Reply-To: References: Date: Fri, 28 Mar 2014 15:30:39 +0530 Message-ID: Subject: Re: optimize hive query for multitable join where one table is huge From: Chinna Rao Lalam To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11339e2e2271de04f5a7c64a X-Virus-Checked: Checked by ClamAV on apache.org --001a11339e2e2271de04f5a7c64a Content-Type: text/plain; charset=ISO-8859-1 Hi, In hive different types of joins are there like join, map join , bucket map join and etc. Please take a look of these it may help you to optimize your query https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919 Hope It Helps, Chinna On Thu, Mar 27, 2014 at 4:24 AM, Srinivasan Ramaswamy wrote: > I have a join query where i am joining huge tables and i am trying to > optimize this hive query. > > INSERT OVERWRITE TABLE result > SELECT /*+ STREAMTABLE(product) */ > i.IMAGE_ID, > p.PRODUCT_NO, > p.STORE_NO, > p.PRODUCT_CAT_NO, > p.CAPTION, > p.PRODUCT_DESC, > p.IMAGE1_ID, > p.IMAGE2_ID, > s.STORE_ID, > s.STORE_NAME, > p.CREATE_DATE, > CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END, > CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END, > CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END > FROM image i > JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID > JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO > JOIN STORE s ON p.STORE_NO = s.STORE_NO > JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID > LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = > custImg.IMAGE_ID > LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = > custImg1.IMAGE_ID > LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = > custImg2.IMAGE_ID; > > Here are some facts about the tables > image table has 60 million rows > product table has 1 billion rows > product_cat has 1000 rows > store has 1m rows > stock_info has 100 rows > customizable_image has 200k rows > > a product can have one or two images (image1 and image2) and product level > information are stored only in product table. i tried moving the join with > product to the bottom but i couldnt as all other following joins require > data from the product table. > > Here is what i tried so far: > 1. I gave the hint to hive to stream product table as its the biggest one > 2. I bucketed the table (during create table of image and product) into > 256 buckets (on image_id) and then did the join - didnt give me any > significant performance gain > 3. changed the input format to sequence file from textfile(gzip files) , > so that it can be splittable and hence more mappers can be run if hive want > to run more mappers > > The query is still taking longer than 5 hours in Hive (running in aws with > 3 large nodes) where as in RDBMS it takes only 5 hrs. I need some help in > optimizing this query, so that it executes much faster. what else can i > try, does partitioning the table help in improving join performance ? > > This brings me to the question, "is Hive even the right choice (compared > to rdbms) for such complex joins" ? > > Thanks > Srini > --001a11339e2e2271de04f5a7c64a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,

In hive different types of joi= ns are there like join, map join , bucket map join and etc.


Hope It Helps,
Chinna


On Thu, Ma= r 27, 2014 at 4:24 AM, Srinivasan Ramaswamy <ursvasan@gmail.com>= wrote:
I have a join query wh= ere i am joining huge tables and i am trying to optimize this hive query.= =A0

=A0 =A0 INSERT OVERWRITE TABLE result
=A0 =A0= SELECT /*+ STREAMTABLE(product) */
=A0 =A0 i.IMAGE_ID,=A0
=A0 =A0 p.PRODUCT_NO,
=A0 = =A0 p.STORE_NO,
=A0 =A0 p.PRODUCT_CAT_NO,
=A0 =A0 p.CAP= TION,
=A0 =A0 p.PRODUCT_DESC,
=A0 =A0 p.IMAGE1_ID,
=A0 =A0 p.IMAGE2_ID,=A0
=A0 =A0 s.STORE_ID,=A0
=A0 =A0 s.STORE_NAME,=A0
= =A0 =A0 p.CREATE_DATE,=A0
=A0 =A0 CASE WHEN custImg.IMAGE_ID is N= ULL THEN 0 ELSE 1 END,=A0
=A0 =A0 CASE WHEN custImg1.IMAGE_ID is = NULL THEN 0 ELSE 1 END,=A0
=A0 =A0 CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
=A0 =A0 FROM image i =A0
=A0 =A0 JOIN PRODUCT p ON i.IMAGE_ID = =3D p.IMAGE1_ID
=A0 =A0 JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO= =3D pcat.PRODUCT_CAT_NO
=A0 =A0 JOIN STORE s ON p.STORE_NO =3D s.STORE_NO
=A0 =A0 JO= IN STOCK_INFO si ON si.STOCK_INFO_ID =3D pcat.STOCK_INFO_ID=A0
= =A0 =A0 LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID =3D custIm= g.IMAGE_ID=A0
=A0 =A0 LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID =3D cust= Img1.IMAGE_ID=A0
=A0 =A0 LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custI= mg2 ON p.IMAGE2_ID =3D custImg2.IMAGE_ID;

Here are= some facts about the tables
image table has 60 million rows
product table has 1 billion = rows
product_cat has 1000 rows
store has 1m rows
<= div>stock_info has 100 rows
customizable_image has 200k rows

a product can have one or two images (image1 and image2= ) and product level information are stored only in product table. i tried m= oving the join with product to the bottom but i couldnt as all other follow= ing joins require data from the product table.

Here is what i tried so far:=A0
1. I gave the= hint to hive to stream product table as its the biggest one
2. I= bucketed the table (during create table of image and product) into 256 buc= kets (on image_id) and then did the join - didnt give me any significant pe= rformance gain
3. changed the input format to sequence file from textfile(gzip files)= , so that it can be splittable and hence more mappers can be run if hive w= ant to run more mappers

The query is still taking = longer than 5 hours in Hive (running in aws with 3 large nodes) where as in= RDBMS it takes only 5 hrs. I need some help in optimizing this query, so t= hat it executes much faster. what else can i try, does partitioning the tab= le help in improving join performance ?

This brings me to the question, "is Hive even the = right choice (compared to rdbms) for such complex joins" ?
<= br>
Thanks
Srini

--001a11339e2e2271de04f5a7c64a--