Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of nitinpawar432@gmail.com
 designates 209.85.215.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKm=R7WoAvh3-mipgxkuYr=d9v9i_+8-z-c_+e3cQTuDrDr-hQ@mail.gmail.com>
References: 
 <CAGF+3rYACkBHPFS5+5Ww8wzMniJihUCjxA=wFjzfOcX5y0pPeQ@mail.gmail.com>
	<CAKm=R7WoAvh3-mipgxkuYr=d9v9i_+8-z-c_+e3cQTuDrDr-hQ@mail.gmail.com>
Date: Thu, 13 Dec 2012 11:00:38 +0530
Message-ID: 
 <CAORpBsjV4km0zgAxHHGUatJZCCaMG-rYuDrfTL5UQOJS2TKFBA@mail.gmail.com>
Subject: Re: map side join with group by
From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=bcaec554d7260f973804d0b5375a

--bcaec554d7260f973804d0b5375a
Content-Type: text/plain; charset=ISO-8859-1

I think Chen wanted to know why this is two phased query if I understood it
correctly

When you run a mapside join .. it just performs the join query .. after
that to execute the group by part it launches the second job.
I may be wrong but this is how I saw it whenever I executed group by
queries


On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <grover.markgrover@gmail.com>wrote:

> Hi Chen,
> I think we would need some more information.
>
> The query is referring to a table called "d" in the MAPJOIN hint but
> there is not such table in the query. Moreover, Map joins only make
> sense when the right table is the one being "mapped" (in other words,
> being kept in memory) in case of a Left Outer Join, similarly if the
> left table is the one being "mapped" in case of a Right Outer Join.
> Let me know if this is not clear, I'd be happy to offer a better
> explanation.
>
> In your query, the where clause on a column called "hour", at this
> point I am unsure if that's a column of table1 or table2. If it's
> column on table1, that predicate would get pushed up (if you have
> hive.optimize.ppd property set to true), so it could possibly be done
> in 1 MR job (I am not sure if that's presently the case, you will have
> to check the explain plan). If however, the where clause is on a
> column in the right table (table2 in your example), it can't be pushed
> up since a column of the right table can have different values before
> and after the LEFT OUTER JOIN. Therefore, the where clause would need
> to be applied in a separate MR job.
>
> This is just my understanding, the full proof answer would lie in
> checking out the explain plans and the Semantic Analyzer code.
>
> And for completeness, there is a conditional task (starting Hive 0.7)
> that will convert your joins automatically to map joins where
> applicable. This can be enabled by enabling hive.auto.convert.join
> property.
>
> Mark
>
> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song.82@gmail.com> wrote:
> > I have a silly question on how Hive interpretes a simple query with both
> map
> > side join and group by.
> >
> > Below query will translate into two jobs, with the 1st one as a map only
> job
> > doing the join and storing the output in a intermediary location, and the
> > 2nd one as a map-reduce job taking the output of the 1st job as input and
> > doing the group by.
> >
> > SELECT
> > /*+ MAPJOIN(d) */
> > table.a, sum(table2.b)
> > from table
> > LEFT OUTER JOIN table2
> > ON table.id = table2.id
> > where hour = '2012-12-11 11'
> > group by table.a
> >
> > Why can't this be done within a single map reduce job? As what I can see
> > from the query plan is that all 2nd job mapper do is taking the 1st job's
> > mapper output.
> >
> > --
> > Chen Song
> >
> >
>


-- 
Nitin Pawar

--bcaec554d7260f973804d0b5375a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think Chen wanted to know why this is two phased query if I understood it=
 correctly=A0<div><br></div><div>When you run a mapside join .. it just per=
forms the join query .. after that to execute the group by part it launches=
 the second job.=A0</div>
<div>I may be wrong but this is how I saw it whenever I executed group by q=
ueries=A0</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote=
">On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:grover.markgrover@gmail.com" target=3D"_blank">grover.markgrove=
r@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Chen,<br>
I think we would need some more information.<br>
<br>
The query is referring to a table called &quot;d&quot; in the MAPJOIN hint =
but<br>
there is not such table in the query. Moreover, Map joins only make<br>
sense when the right table is the one being &quot;mapped&quot; (in other wo=
rds,<br>
being kept in memory) in case of a Left Outer Join, similarly if the<br>
left table is the one being &quot;mapped&quot; in case of a Right Outer Joi=
n.<br>
Let me know if this is not clear, I&#39;d be happy to offer a better<br>
explanation.<br>
<br>
In your query, the where clause on a column called &quot;hour&quot;, at thi=
s<br>
point I am unsure if that&#39;s a column of table1 or table2. If it&#39;s<b=
r>
column on table1, that predicate would get pushed up (if you have<br>
hive.optimize.ppd property set to true), so it could possibly be done<br>
in 1 MR job (I am not sure if that&#39;s presently the case, you will have<=
br>
to check the explain plan). If however, the where clause is on a<br>
column in the right table (table2 in your example), it can&#39;t be pushed<=
br>
up since a column of the right table can have different values before<br>
and after the LEFT OUTER JOIN. Therefore, the where clause would need<br>
to be applied in a separate MR job.<br>
<br>
This is just my understanding, the full proof answer would lie in<br>
checking out the explain plans and the Semantic Analyzer code.<br>
<br>
And for completeness, there is a conditional task (starting Hive 0.7)<br>
that will convert your joins automatically to map joins where<br>
applicable. This can be enabled by enabling hive.auto.convert.join<br>
property.<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
Mark<br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
On Wed, Dec 12, 2012 at 3:32 PM, Chen Song &lt;<a href=3D"mailto:chen.song.=
82@gmail.com">chen.song.82@gmail.com</a>&gt; wrote:<br>
&gt; I have a silly question on how Hive interpretes a simple query with bo=
th map<br>
&gt; side join and group by.<br>
&gt;<br>
&gt; Below query will translate into two jobs, with the 1st one as a map on=
ly job<br>
&gt; doing the join and storing the output in a intermediary location, and =
the<br>
&gt; 2nd one as a map-reduce job taking the output of the 1st job as input =
and<br>
&gt; doing the group by.<br>
&gt;<br>
&gt; SELECT<br>
&gt; /*+ MAPJOIN(d) */<br>
&gt; table.a, sum(table2.b)<br>
&gt; from table<br>
&gt; LEFT OUTER JOIN table2<br>
&gt; ON <a href=3D"http://table.id" target=3D"_blank">table.id</a> =3D <a h=
ref=3D"http://table2.id" target=3D"_blank">table2.id</a><br>
&gt; where hour =3D &#39;<a href=3D"tel:2012-12-11%2011" value=3D"+12012121=
111">2012-12-11 11</a>&#39;<br>
&gt; group by table.a<br>
&gt;<br>
&gt; Why can&#39;t this be done within a single map reduce job? As what I c=
an see<br>
&gt; from the query plan is that all 2nd job mapper do is taking the 1st jo=
b&#39;s<br>
&gt; mapper output.<br>
&gt;<br>
&gt; --<br>
&gt; Chen Song<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Nitin Pawar<br>
</div>

--bcaec554d7260f973804d0b5375a--