Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F488EA43 for ; Thu, 22 Nov 2012 15:07:54 +0000 (UTC) Received: (qmail 37999 invoked by uid 500); 22 Nov 2012 15:07:52 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 37896 invoked by uid 500); 22 Nov 2012 15:07:52 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 37869 invoked by uid 99); 22 Nov 2012 15:07:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Nov 2012 15:07:51 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dimad@microsoft.com designates 207.46.100.23 as permitted sender) Received: from [207.46.100.23] (HELO NA01-BY2-obe.outbound.protection.outlook.com) (207.46.100.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Nov 2012 15:07:42 +0000 Received: from BY2FFO11FD007.protection.gbl (10.1.15.202) by BY2FFO11HUB035.protection.gbl (10.1.14.119) with Microsoft SMTP Server (TLS) id 15.0.556.9; Thu, 22 Nov 2012 15:07:17 +0000 Received: from TK5EX14MLTC102.redmond.corp.microsoft.com (131.107.125.37) by BY2FFO11FD007.mail.protection.outlook.com (10.1.14.128) with Microsoft SMTP Server (TLS) id 15.0.556.9 via Frontend Transport; Thu, 22 Nov 2012 15:07:17 +0000 Received: from DB3EX14HUBC302.europe.corp.microsoft.com (10.166.18.182) by TK5EX14MLTC102.redmond.corp.microsoft.com (157.54.79.180) with Microsoft SMTP Server (TLS) id 14.2.318.3; Thu, 22 Nov 2012 15:06:47 +0000 Received: from DB3EX14MBXC316.europe.corp.microsoft.com ([169.254.16.34]) by DB3EX14HUBC302.europe.corp.microsoft.com ([10.166.18.182]) with mapi id 14.02.0309.003; Thu, 22 Nov 2012 15:06:45 +0000 From: Dima Datsenko To: Bennie Schut , "user@hive.apache.org" Subject: RE: Effecient partitions usage in join Thread-Topic: Effecient partitions usage in join Thread-Index: Ac3Itp7vKQ0i7EFESLirkoP8vQengQAAoWLgAAC37MAAAa2F0A== Date: Thu, 22 Nov 2012 15:06:45 +0000 Message-ID: <2BBBD93B295A4442A5B731B34A80C5622AA72A8E@DB3EX14MBXC316.europe.corp.microsoft.com> References: <2BBBD93B295A4442A5B731B34A80C5622AA72A09@DB3EX14MBXC316.europe.corp.microsoft.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.166.18.94] Content-Type: multipart/alternative; boundary="_000_2BBBD93B295A4442A5B731B34A80C5622AA72A8EDB3EX14MBXC316e_" MIME-Version: 1.0 X-Forefront-Antispam-Report: CIP:131.107.125.37;CTRY:US;IPV:CAL;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(164054002)(54094002)(377454001)(33656001)(5343635001)(47736001)(47446002)(50986001)(74662001)(74502001)(54356001)(4396001)(5343655001)(53806001)(46102001)(31966008)(44976002)(76482001)(56816002)(56776001)(51856001)(47976001)(16406001)(49866001)(54316002)(55846005);DIR:OUT;SFP:;LANG:en; X-OriginatorOrg: microsoft.onmicrosoft.com X-Forefront-PRVS: 0673F5BE31 X-Virus-Checked: Checked by ClamAV on apache.org --_000_2BBBD93B295A4442A5B731B34A80C5622AA72A8EDB3EX14MBXC316e_ Content-Type: text/plain; charset="windows-1255" Content-Transfer-Encoding: quoted-printable Hi Benny, The udf solution sounds like a plan. Much better than generating hive query= with hardcoded partition out of table B. Can you please provide a sample o= f what you=92re doing there? Thanks, Dima From: Bennie Schut [mailto:bschut@ebuddy.com] Sent: =E9=E5=ED =E4 22 =F0=E5=E1=EE=E1=F8 2012 16:28 To: user@hive.apache.org Cc: Dima Datsenko Subject: RE: Effecient partitions usage in join Unfortunately at the moment partition pruning is a bit limited in hive. Whe= n hive creates the query plan it decides what partitions to use. So if you = put hardcoded list of partition_id items in the where clause it will know w= hat to do. In the case of a join (or a subquery) it would have to run the q= uery before it can know what it can prune. There are obvious solutions to = this but they are simply not implemented at the moment. Generally speaking people try to work around this by not normalizing the da= ta. So if you plan on doing a clean star schema with a calendar table then = do yourself a favor and but the actual date in the fact table and not a mea= ningless key. It=92s also good to realize you can (in some special cases) work around it = by using udf=92s. I=92ve used it once by creating a udf which produced the = current date which I flagged as deterministic (ugly I know). This causes th= e planner to run the udf during planning and use the result as if it=92s a = constant and thus partition pruning works again. It=92s currently the only = way I know to select x days of data with partition pruning working. From: Dima Datsenko [mailto:dimad@microsoft.com] Sent: Thursday, November 22, 2012 2:56 PM To: user@hive.apache.org Subject: Effecient partitions usage in join Hi Guys, I wonder if you could help me. I have a huge Hive table partitioned by some field. It has thousands of par= titions. Now I have another small table containing tens of partitions id. I=92d like= to get the data only from those partitions. However when I run Select * from A join B on (A.partition_id =3D B.partition_id), It reads all data from A, then from B and on reduce stage performs join. I tried /*+ MAPJOIN*/ it ran faster sparing reduce operation, but still rea= d the whole A table. Is there a more efficient way to perform the query w/o reading the whole A = content? Thanks Dima --_000_2BBBD93B295A4442A5B731B34A80C5622AA72A8EDB3EX14MBXC316e_ Content-Type: text/html; charset="windows-1255" Content-Transfer-Encoding: quoted-printable

Hi Benny,

 

The udf solution sound= s like a plan. Much better than generating hive query with hardcoded partit= ion out of table B. Can you please provide a sample of what you=92re doing = there?

 

Thanks,

Dima=

 

From: Bennie S= chut [mailto:bschut@ebuddy.com]
Sent: =E9=E5=ED =E4 22 =F0=E5=E1=EE=E1=F8 2012 16:28
To: user@hive.apache.org
Cc: Dima Datsenko
Subject: RE: Effecient partitions usage in join

 

Unfortunately at the m= oment partition pruning is a bit limited in hive. When hive creates the que= ry plan it decides what partitions to use. So if you put hardcoded list of = partition_id items in the where clause it will know what to do. In the case of a join (or a subquery) it would ha= ve to run the query before it can know what it can prune.  There are o= bvious solutions to this but they are simply not implemented at the moment.=

Generally speaking peo= ple try to work around this by not normalizing the data. So if you plan on = doing a clean star schema with a calendar table then do yourself a favor an= d but the actual date in the fact table and not a meaningless key.

It=92s also good to re= alize you can (in some special cases) work around it by using udf=92s. I=92= ve used it once by creating a udf which produced the current date which I f= lagged as deterministic (ugly I know). This causes the planner to run the udf during planning and use the result as if= it=92s a constant and thus partition pruning works again. It=92s currently= the only way I know to select x days of data with partition pruning workin= g.

 

 

From: Dima Dat= senko [mailto:dimad@microsoft.com]
Sent: Thursday, November 22, 2012 2:56 PM
To:
user@hive.apache.org=
Subject: Effecient partitions usage in join

 

Hi Guys,

 

I wonder if you could help me.

 

I have a huge Hive table partitioned by some field. = It has thousands of partitions.

Now I have another small table containing tens of pa= rtitions id. I=92d like to get the data only from those partitions.

 

However when I run

Select * from A join B on (A.partition_id =3D B.part= ition_id),

It reads all data from A, then from B and on reduce = stage performs join.

 

I tried /*+ MAPJOIN*/ it ran faster sparing reduce operation, but still read th= e whole A table.

 

Is there a more efficient way to perform the query w= /o reading the whole A content?

 

 

Thanks

Dima

--_000_2BBBD93B295A4442A5B731B34A80C5622AA72A8EDB3EX14MBXC316e_--