Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4969CDFCB for ; Thu, 13 Dec 2012 05:31:10 +0000 (UTC) Received: (qmail 25121 invoked by uid 500); 13 Dec 2012 05:31:08 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 24909 invoked by uid 500); 13 Dec 2012 05:31:07 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 24878 invoked by uid 99); 13 Dec 2012 05:31:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 05:31:06 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nitinpawar432@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 05:30:59 +0000 Received: by mail-la0-f48.google.com with SMTP id m13so1397106lah.35 for ; Wed, 12 Dec 2012 21:30:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=v+nfhgaCTuvtthVWSoC8oildEv+Z/2zbTAzR3+I0C+g=; b=imToGGWc8RfwRhXI3cbz3h4lKeXsbCrzz5cn92cL2xA01maX16OKrP32k/wjyv4PfH ft+z65NgSgCRKok7EuMDS5yKXNwp0zKuV7C770/QZ7v/z+f9wrmgCOVDK6IOdckSk+Oq zvJsLQqe5NDD42B/wFstFGRoeYMi8sDnHwhs9Limw96iB0ChBry83CZxGLvJlc+OdNOU WLVay2Dsn6KAW27EAHwT7sXTDGPBxcodhxgIJN5wovsI0ILRN1cvCrkzgIVizUr0ZPBG GGY14oIDH6MxLmYShCmLV1BurZ5+cxvau9+V/s1TYUfsCiqRT032POluCyUU7F2/gIk8 jhjA== MIME-Version: 1.0 Received: by 10.112.47.42 with SMTP id a10mr417521lbn.30.1355376638268; Wed, 12 Dec 2012 21:30:38 -0800 (PST) Received: by 10.112.112.164 with HTTP; Wed, 12 Dec 2012 21:30:38 -0800 (PST) In-Reply-To: References: Date: Thu, 13 Dec 2012 11:00:38 +0530 Message-ID: Subject: Re: map side join with group by From: Nitin Pawar To: user@hive.apache.org Content-Type: multipart/alternative; boundary=bcaec554d7260f973804d0b5375a X-Virus-Checked: Checked by ClamAV on apache.org --bcaec554d7260f973804d0b5375a Content-Type: text/plain; charset=ISO-8859-1 I think Chen wanted to know why this is two phased query if I understood it correctly When you run a mapside join .. it just performs the join query .. after that to execute the group by part it launches the second job. I may be wrong but this is how I saw it whenever I executed group by queries On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover wrote: > Hi Chen, > I think we would need some more information. > > The query is referring to a table called "d" in the MAPJOIN hint but > there is not such table in the query. Moreover, Map joins only make > sense when the right table is the one being "mapped" (in other words, > being kept in memory) in case of a Left Outer Join, similarly if the > left table is the one being "mapped" in case of a Right Outer Join. > Let me know if this is not clear, I'd be happy to offer a better > explanation. > > In your query, the where clause on a column called "hour", at this > point I am unsure if that's a column of table1 or table2. If it's > column on table1, that predicate would get pushed up (if you have > hive.optimize.ppd property set to true), so it could possibly be done > in 1 MR job (I am not sure if that's presently the case, you will have > to check the explain plan). If however, the where clause is on a > column in the right table (table2 in your example), it can't be pushed > up since a column of the right table can have different values before > and after the LEFT OUTER JOIN. Therefore, the where clause would need > to be applied in a separate MR job. > > This is just my understanding, the full proof answer would lie in > checking out the explain plans and the Semantic Analyzer code. > > And for completeness, there is a conditional task (starting Hive 0.7) > that will convert your joins automatically to map joins where > applicable. This can be enabled by enabling hive.auto.convert.join > property. > > Mark > > On Wed, Dec 12, 2012 at 3:32 PM, Chen Song wrote: > > I have a silly question on how Hive interpretes a simple query with both > map > > side join and group by. > > > > Below query will translate into two jobs, with the 1st one as a map only > job > > doing the join and storing the output in a intermediary location, and the > > 2nd one as a map-reduce job taking the output of the 1st job as input and > > doing the group by. > > > > SELECT > > /*+ MAPJOIN(d) */ > > table.a, sum(table2.b) > > from table > > LEFT OUTER JOIN table2 > > ON table.id = table2.id > > where hour = '2012-12-11 11' > > group by table.a > > > > Why can't this be done within a single map reduce job? As what I can see > > from the query plan is that all 2nd job mapper do is taking the 1st job's > > mapper output. > > > > -- > > Chen Song > > > > > -- Nitin Pawar --bcaec554d7260f973804d0b5375a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I think Chen wanted to know why this is two phased query if I understood it= correctly=A0

When you run a mapside join .. it just per= forms the join query .. after that to execute the group by part it launches= the second job.=A0
I may be wrong but this is how I saw it whenever I executed group by q= ueries=A0


On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <grover.markgrove= r@gmail.com> wrote:
Hi Chen,
I think we would need some more information.

The query is referring to a table called "d" in the MAPJOIN hint = but
there is not such table in the query. Moreover, Map joins only make
sense when the right table is the one being "mapped" (in other wo= rds,
being kept in memory) in case of a Left Outer Join, similarly if the
left table is the one being "mapped" in case of a Right Outer Joi= n.
Let me know if this is not clear, I'd be happy to offer a better
explanation.

In your query, the where clause on a column called "hour", at thi= s
point I am unsure if that's a column of table1 or table2. If it's column on table1, that predicate would get pushed up (if you have
hive.optimize.ppd property set to true), so it could possibly be done
in 1 MR job (I am not sure if that's presently the case, you will have<= br> to check the explain plan). If however, the where clause is on a
column in the right table (table2 in your example), it can't be pushed<= br> up since a column of the right table can have different values before
and after the LEFT OUTER JOIN. Therefore, the where clause would need
to be applied in a separate MR job.

This is just my understanding, the full proof answer would lie in
checking out the explain plans and the Semantic Analyzer code.

And for completeness, there is a conditional task (starting Hive 0.7)
that will convert your joins automatically to map joins where
applicable. This can be enabled by enabling hive.auto.convert.join
property.

Mark

On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song.82@gmail.com> wrote:
> I have a silly question on how Hive interpretes a simple query with bo= th map
> side join and group by.
>
> Below query will translate into two jobs, with the 1st one as a map on= ly job
> doing the join and storing the output in a intermediary location, and = the
> 2nd one as a map-reduce job taking the output of the 1st job as input = and
> doing the group by.
>
> SELECT
> /*+ MAPJOIN(d) */
> table.a, sum(table2.b)
> from table
> LEFT OUTER JOIN table2
> ON table.id =3D table2.id
> where hour =3D '2012-12-11 11'
> group by table.a
>
> Why can't this be done within a single map reduce job? As what I c= an see
> from the query plan is that all 2nd job mapper do is taking the 1st jo= b's
> mapper output.
>
> --
> Chen Song
>
>



--
= Nitin Pawar
--bcaec554d7260f973804d0b5375a--