Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of jieli@cs.duke.edu designates
 152.3.140.1 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACYx2-KoGZJjr4H8KbFRk2UTmP4pwgvVrhezet7WY9VvUmy3wg@mail.gmail.com>
References: 
 <CAC7ivpuo=hd++rnPg6zNaz4Nqh2n6NpFZyX3wa-qBRHLdFSMcQ@mail.gmail.com>
	<CAORpBsjTDaEq222Lt4yK0j06mNgM9PZXd2j+1CbrGB9MY4H0AA@mail.gmail.com>
	<CAC7ivptw7CaKzho8H3_O_2mCs7qsjdx6gYEdWXrQ0XbSKNBAhQ@mail.gmail.com>
	<CACYx2-KoGZJjr4H8KbFRk2UTmP4pwgvVrhezet7WY9VvUmy3wg@mail.gmail.com>
Date: Sat, 27 Apr 2013 11:14:18 -0700
Message-ID: 
 <CALY2=u7AHu6jY1O-KkpB9V2qa7yREKwXa5f=-TyN+2P4CCDvXg@mail.gmail.com>
Subject: Re: Huge join performance issue
From: Jie Li <jieli@cs.duke.edu>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=089e0149ce54bf817a04db5b9e87

--089e0149ce54bf817a04db5b9e87
Content-Type: text/plain; charset=ISO-8859-1

In order for us to understand the performance and identify the bottlenecks,
could you do two things:

1) run the EXPLAIN command and share with us the output
2) share with us the hadoop job histories generated by the query. They can
be collected following
http://www.cs.duke.edu/starfish/tutorial/job_history.html

Jie


On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov <igor@decide.com> wrote:

> Did you verify that all your available mappers are running (and reducers
> too)? If you have a small number of partitions with huge files, you might
> me underutilizing mappers (check that the files are being split). Also, it
> might be optimal to have a single "wave" of reducers by setting the number
> of reduce tasks appropriately.
>
> You might also consider optimizing a simpler query first:
>
> select t1.a, count(*)
> from (select a from table baseTB1 where ... ) t1  -- filter by partition
> as well
>   join
>         (select a  from baseTB2 where ...) t2    -- filter by partition as
> well
> on t1.a=t2.a
> group by t1.a
>
> just to give you an idea how much overhead the extra columns are adding.
> If the columns are pretty big they could be causing the slowdown.
>
> igor
> decide.com
>
>
> On Sat, Apr 6, 2013 at 2:30 PM, Gabi D <gabid33@gmail.com> wrote:
>
>> Thank you for your answer Nitin.
>> Does anyone have additional insight into this? will be
>> greatly appreciated.
>>
>>
>> On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> you dont really need subqueries to join the tables which have common
>>> columns. Its an additional overhead
>>> best way to filter your data and speed up your data processing is how
>>> you layout your data
>>> When you have larger table I will use partitioning and bucketing to trim
>>> down the data and improve the performances over joins
>>>
>>> distribute by is mainly used when you have your custom map reduce
>>> scripts and you want to use transform functionality in hive. I have not
>>> used it a lot so not sure on that part. also its helpful to write where
>>> clauses in join statements to reduce the dataset you want to join.
>>>
>>>
>>>
>>> On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <gabid33@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> I have two tables I need to join and then summarize.
>>>> They are both huge (about 1B rows each, in the relevant partitions) and
>>>> the query runs for over 2 hours creating 5T intermediate data.
>>>>
>>>> The current query looks like this:
>>>>
>>>> select t1.b,t1.c,t2.d,t2.e, count(*)
>>>> from (select a,b,c    from table baseTB1 where ... ) t1  -- filter by
>>>> partition as well
>>>>   join
>>>>         (select a,d,e from baseTB2 where ...) t2    -- filter by
>>>> partition as well
>>>> on t1.a=t2.a
>>>> group by t1.b,t1.c,t2.d,t2.e
>>>>
>>>>
>>>> two questions:
>>>> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries)
>>>> be better in any way?
>>>>           (I know subqueries cause a lot of writes of the intermediate
>>>> data but we also understand it's best to filter down the data that is being
>>>> joined, which is "more" correct?)
>>>> 2. can I use 'distribute by ' and/or 'sort by' in some way that would
>>>> help this? my understanding at the moment is that the problem lies in the
>>>> fact
>>>> that the reduces are on column a while the group by is on column b ...
>>>>
>>>> Any thoughts would be appreciated.
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

--089e0149ce54bf817a04db5b9e87
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">In order for us to understand the performance and identify=
 the bottlenecks, could you do two things:<div><br></div><div style>1) run =
the EXPLAIN command and share with us the output</div><div style>2) share w=
ith us the hadoop job histories generated by the query. They can be collect=
ed following=A0<a href=3D"http://www.cs.duke.edu/starfish/tutorial/job_hist=
ory.html">http://www.cs.duke.edu/starfish/tutorial/job_history.html</a></di=
v>
<div style><br></div><div style>Jie</div><div style><br></div><div style><b=
r></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote"=
>On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:igor@decide.com" target=3D"_blank">igor@decide.com</a>&gt;</s=
pan> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Did you =
verify that all your available mappers are running (and reducers too)? If y=
ou have a small number of partitions with huge files, you might me underuti=
lizing mappers (check that the files are being split). Also, it might be op=
timal to have a single &quot;wave&quot; of reducers by setting the number o=
f reduce tasks appropriately.</div>

<div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-seri=
f;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-fam=
ily:arial,helvetica,sans-serif;font-size:small">You might also consider opt=
imizing a simpler query first:</div>

<div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-seri=
f;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-fam=
ily:arial,helvetica,sans-serif;font-size:small"><div style=3D"font-family:a=
rial,sans-serif;font-size:13px">

<span style=3D"font-size:12.800000190734863px;white-space:nowrap">select=A0=
</span><span style=3D"font-size:12.800000190734863px;white-space:nowrap">t1=
.a,</span><span style=3D"font-size:12.800000190734863px;white-space:nowrap"=
>=A0count(*)</span></div>

<div style=3D"font-family:arial,sans-serif;font-size:13px"><span style=3D"f=
ont-size:12.800000190734863px;white-space:nowrap">from (select a from table=
 baseTB1 where ... ) t1 =A0-- filter by partition as well</span></div><div =
style=3D"font-family:arial,sans-serif;font-size:13px">

<span style=3D"font-size:12.800000190734863px;white-space:nowrap">=A0 join=
=A0</span></div><div style=3D"font-family:arial,sans-serif;font-size:13px">=
<span style=3D"font-size:12.800000190734863px;white-space:nowrap">=A0 =A0 =
=A0 =A0 (select a =A0from baseTB2 where ...) t2 =A0 =A0-- filter by partiti=
on as well</span></div>

<div style=3D"font-family:arial,sans-serif;font-size:13px"><span style=3D"f=
ont-size:12.800000190734863px;white-space:nowrap">on t1.a=3Dt2.a</span></di=
v><div style=3D"font-family:arial,sans-serif;font-size:13px"><span style=3D=
"font-size:12.800000190734863px;white-space:nowrap">group by t1.a</span></d=
iv>

</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,san=
s-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"fo=
nt-family:arial,helvetica,sans-serif;font-size:small">just to give you an i=
dea how much overhead the extra columns are adding. If the columns are pret=
ty big they could be causing the slowdown.</div>

<div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-seri=
f;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-fam=
ily:arial,helvetica,sans-serif;font-size:small">igor</div><div class=3D"gma=
il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
">

<a href=3D"http://decide.com" target=3D"_blank">decide.com</a></div></div><=
div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><br><=
div class=3D"gmail_quote">On Sat, Apr 6, 2013 at 2:30 PM, Gabi D <span dir=
=3D"ltr">&lt;<a href=3D"mailto:gabid33@gmail.com" target=3D"_blank">gabid33=
@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thank you for your answer N=
itin.<div>Does anyone have additional insight into this? will be greatly=A0=
appreciated.</div>

</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:nitinpawar432@gmail.com" target=3D"_blank">nitinpawar432@gmail.=
com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">you dont really need subque=
ries to join the tables which have common columns. Its an additional overhe=
ad=A0<div>


best way to filter your data and speed up your data processing is how you l=
ayout your data</div><div>
When you have larger table I will use partitioning and bucketing to trim do=
wn the data and improve the performances over joins=A0</div><div><br></div>=
<div>distribute by is mainly used when you have your custom map reduce scri=
pts and you want to use transform functionality in hive. I have not used it=
 a lot so not sure on that part. also its helpful to write where clauses in=
 join statements to reduce the dataset you want to join.=A0<br>


<div><br></div></div></div><div class=3D"gmail_extra"><div><div><br><br><di=
v class=3D"gmail_quote">On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <span dir=3D=
"ltr">&lt;<a href=3D"mailto:gabid33@gmail.com" target=3D"_blank">gabid33@gm=
ail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><span style=3D"color:r=
gb(34,34,34);font-family:arial,sans-serif;font-size:12.800000190734863px;wh=
ite-space:nowrap">Hi all,</span></div>


<div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;whit=
e-space:nowrap;font-family:arial,sans-serif">I have two tables I need to jo=
in and then summarize.=A0</span></div><div>
<span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-spa=
ce:nowrap;font-family:arial,sans-serif">They are both huge (about 1B rows e=
ach, in the relevant partitions) and the query runs for over 2 hours creati=
ng 5T intermediate data.</span></div>


<div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;whit=
e-space:nowrap;font-family:arial,sans-serif"><br></span></div><div><span st=
yle=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-space:nowra=
p;font-family:arial,sans-serif">The current query looks like this:</span></=
div>


<div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;whit=
e-space:nowrap;font-family:arial,sans-serif"><br></span></div><div><span st=
yle=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-space:nowra=
p;font-family:arial,sans-serif">select=A0</span><span style=3D"color:rgb(34=
,34,34);font-family:arial,sans-serif;font-size:12.800000190734863px;white-s=
pace:nowrap">t1.b,t1.c,t2.d,t2.e,</span><span style=3D"color:rgb(34,34,34);=
font-size:12.800000190734863px;white-space:nowrap;font-family:arial,sans-se=
rif">=A0count(*)</span></div>


<div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;whit=
e-space:nowrap;font-family:arial,sans-serif">from (select a,b,c =A0 =A0from=
 table baseTB1 where ... ) t1 =A0-- filter by partition as well</span></div=
><div>


<span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-spa=
ce:nowrap;font-family:arial,sans-serif">=A0 join=A0</span></div><div><span =
style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-space:now=
rap;font-family:arial,sans-serif">=A0 =A0 =A0 =A0 (select a,d,e from baseTB=
2 where ...) t2 =A0 =A0-- filter by partition as well</span></div>


<div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;whit=
e-space:nowrap;font-family:arial,sans-serif">on t1.a=3Dt2.a</span></div><di=
v><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-s=
pace:nowrap;font-family:arial,sans-serif">group by t1.b,t1.c,t2.d,t2.e</spa=
n></div>


<div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734863px;whit=
e-space:nowrap;font-family:arial,sans-serif"><br></span></div><div><span st=
yle=3D"color:rgb(34,34,34);font-size:12.800000190734863px;white-space:nowra=
p;font-family:arial,sans-serif"><br>


</span></div><div><span style=3D"color:rgb(34,34,34);font-size:12.800000190=
734863px;white-space:nowrap;font-family:arial,sans-serif">two questions:</s=
pan></div><div><span style=3D"color:rgb(34,34,34);font-size:12.800000190734=
863px;white-space:nowrap;font-family:arial,sans-serif">1. would joining=A0<=
/span><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-=
size:12.800000190734863px;white-space:nowrap">baseTB1 and=A0</span><span st=
yle=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.800000=
190734863px;white-space:nowrap">baseTB2 directly (instead of subqueries) be=
 better in any way?</span></div>


<div><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-s=
ize:12.800000190734863px;white-space:nowrap">=A0 =A0 =A0 =A0 =A0 (I know su=
bqueries cause a lot of writes of the intermediate data but we also underst=
and it&#39;s best to filter down the data that is being joined, which is &q=
uot;more&quot; correct?)</span></div>


<div><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-s=
ize:12.800000190734863px;white-space:nowrap">2. can I use &#39;distribute b=
y &#39; and/or &#39;sort by&#39; in some way that would help this? my under=
standing at the moment is that the problem lies in the fact</span></div>


<div><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-s=
ize:12.800000190734863px;white-space:nowrap">that the reduces are on column=
 a while the group by is on column b ...</span></div><div><span style=3D"co=
lor:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.800000190734863=
px;white-space:nowrap"><br>


</span></div><div><span style=3D"color:rgb(34,34,34);font-family:arial,sans=
-serif;font-size:12.800000190734863px;white-space:nowrap">Any thoughts woul=
d be appreciated.</span></div><div><span style=3D"color:rgb(34,34,34);font-=
family:arial,sans-serif;font-size:12.800000190734863px;white-space:nowrap">=
<br>


</span></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--089e0149ce54bf817a04db5b9e87--