Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <CC8B43C8-43D0-488F-9645-14238A7DD6D3@gmail.com>
References: <0B79454B00C4474A8C2FD9FCAD600F94431250@APSWP0541.ms.ds.uhc.com>
	<CC8B43C8-43D0-488F-9645-14238A7DD6D3@gmail.com>
Date: Thu, 24 Mar 2016 07:38:15 +0000
Message-ID: 
 <CAJ3fcbDVTGU4Xer9N2HWXeZi0RAsJ0qtHzJ6UWFM0HsGJPg-gA@mail.gmail.com>
Subject: Re: Issue joining 21 HUGE Hive tables
From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
To: user <user@hive.apache.org>
Content-Type: multipart/alternative; boundary=001a114406ea899398052ec68773

--001a114406ea899398052ec68773
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Posting a typical query that you are using will help to clarify the issue.

Also you may use TEMPORARY TABLEs to keep the intermediate stage results.

On the face of it you can time every query itself to find out the longest
components etc

select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS
StartTime;

CREATE TEMPORARY TABLE tmp AS
SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
TotalSales
--FROM smallsales s, times t, channels c
FROM sales s, times t, channels c
WHERE s.time_id =3D t.time_id
AND   s.channel_id =3D c.channel_id
GROUP BY t.calendar_month_desc, c.channel_desc
;
select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS
FirstQuery;

SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
from tmp
ORDER BY MONTH, CHANNEL LIMIT 5
;

HTH

Dr Mich Talebzadeh


LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6=
zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO=
ABUrV8Pw>*


http://talebzadehmich.wordpress.com


On 24 March 2016 at 06:36, J=C3=B6rn Franke <jornfranke@gmail.com> wrote:

> Joining so many external tables is always an issue with any component.
> Your problem is not Hive specific; but your data model seems to be messed
> up. First of all you should have them in an appropriate format, such as O=
RC
> or parquet and the tables should not be external. Then you should use the
> right data types for columns, eg an int instead of a varchar if you have
> just numbers in a column. After that check if you can prejoin and store t=
he
> data in one big flat table and do queries on that.
>
> Then you should look at the min / max indexes , bloom filters, statistics=
,
> partitions etc.
>
> Maybe you can post more details about data model and queries.
>
> On 24 Mar 2016, at 02:49, Sanka, Himabindu <himabindu_sanka@optum.com>
> wrote:
>
> Hi Team,
>
>
>
> I need some inputs from you. I have a requirement for my project where I
> have to join 21 hive external tables.
>
>
>
> Out of which 6 tables are HUGE  having 500 million records of data. Other
> 15 tables are smaller ones around 100 to 1000 records each.
>
>
>
> When I am doing inner joins/ left outer joins its taking hours to run the
> query.
>
>
>
> Please let me know some optimization techniques or any other eco system
> components that performs better than HIVE.
>
>
>
>
>
> *Regards,*
>
> Hima
>
>
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intende=
d
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>
>

--001a114406ea899398052ec68773
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Posting a=C2=A0typical query that you are using will =
help to clarify the issue.</div><div><br></div><div>Also you may use TEMPOR=
ARY TABLEs to keep the intermediate stage results.</div><div><br></div><div=
>On the face of it you can time every query itself to find out the longest =
components etc</div><div><br></div><div><font color=3D"#0000ff" face=3D"mon=
ospace,monospace">select from_unixtime(unix_timestamp(), &#39;dd/MM/yyyy HH=
:mm:ss.ss&#39;) AS StartTime;</font></div><div><font color=3D"#0000ff" face=
=3D"monospace,monospace"><br></font></div><div><font color=3D"#0000ff" face=
=3D"monospace,monospace">CREATE TEMPORARY TABLE tmp AS<br>SELECT t.calendar=
_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales<br>--FROM sma=
llsales s, times t, channels c<br>FROM sales s, times t, channels c<br>WHER=
E s.time_id =3D t.time_id<br>AND=C2=A0=C2=A0 s.channel_id =3D c.channel_id<=
br>GROUP BY t.calendar_month_desc, c.channel_desc<br>;<br>select from_unixt=
ime(unix_timestamp(), &#39;dd/MM/yyyy HH:mm:ss.ss&#39;) AS FirstQuery;<br><=
/font></div><div><font color=3D"#0000ff" face=3D"monospace,monospace"><br><=
/font></div><div><font color=3D"#0000ff" face=3D"monospace,monospace">SELEC=
T calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales<br>from=
 tmp<br>ORDER BY MONTH, CHANNEL LIMIT 5<br>;</font><br></div><div class=3D"=
gmail_extra"><br></div><div class=3D"gmail_extra">HTH</div><div class=3D"gm=
ail_extra"><br clear=3D"all"></div><div class=3D"gmail_extra"><div class=3D=
"gmail_signature"><div dir=3D"ltr"><font color=3D"#000000" face=3D"Times Ne=
w Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">Dr Mich Talebzadeh</font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif"><font color=3D"#000000" size=3D"3">LinkedIn </font></s=
pan><i><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><font color=3D"#000000">=C2=A0</font><a href=3D"https://www.linkedin.co=
m/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_bla=
nk"><font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEA=
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw</font></a></span></i></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt;text-align:justify"><span style=3D"fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://ta=
lebzadehmich.wordpress.com/" target=3D"_blank"><font color=3D"#0000ff">http=
://talebzadehmich.wordpress.com</font></a></span></p><font color=3D"#000000=
" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif;font-size:9pt"><font color=3D"#000000">=C2=A0</font></s=
pan></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div></div></div><div class=3D"gmail_extra">
<br></div><div class=3D"gmail_quote">On 24 March 2016 at 06:36, J=C3=B6rn F=
ranke <span dir=3D"ltr">&lt;<a href=3D"mailto:jornfranke@gmail.com" target=
=3D"_blank">jornfranke@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;padding-left:1ex;border-=
left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">=
<div dir=3D"auto"><div>Joining so many external tables is always an issue w=
ith any component. Your problem is not Hive specific; but your data model s=
eems to be messed up. First of all you should have them in an appropriate f=
ormat, such as ORC or parquet and the tables should not be external. Then y=
ou should use the right data types for columns, eg an int instead of a varc=
har if you have just numbers in a column. After that check if you can prejo=
in and store the data in one big flat table and do queries on that.</div><d=
iv><br></div><div>Then you should look at the min / max indexes , bloom fil=
ters, statistics, partitions etc.=C2=A0</div><div><br></div><div>Maybe you =
can post more details about data model and queries.=C2=A0</div><div><div cl=
ass=3D"h5"><div><br>On 24 Mar 2016, at 02:49, Sanka, Himabindu &lt;<a href=
=3D"mailto:himabindu_sanka@optum.com" target=3D"_blank">himabindu_sanka@opt=
um.com</a>&gt; wrote:<br><br></div><blockquote type=3D"cite"><div>


<div>
<p class=3D"MsoNormal">Hi Team,<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">I need some inputs from you. I have a requirement fo=
r my project where I have to join 21 hive external tables.<u></u><u></u></p=
>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Out of which 6 tables are HUGE =C2=A0having 500 mill=
ion records of data. Other 15 tables are smaller ones around 100 to 1000 re=
cords each.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">When I am doing inner joins/ left outer joins its ta=
king hours to run the query.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Please let me know some optimization techniques or a=
ny other eco system components that performs better than HIVE.<u></u><u></u=
></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal"><b><span style=3D"color:rgb(84,141,212);font-family:=
&quot;Arial&quot;,&quot;sans-serif&quot;;font-size:9pt">Regards,<u></u><u><=
/u></span></b></p>
<p class=3D"MsoNormal"><span lang=3D"FR" style=3D"color:rgb(99,102,106);fon=
t-family:&quot;Arial&quot;,&quot;sans-serif&quot;;font-size:9pt">Hima<u></u=
><u></u></span></p>
<p class=3D"MsoNormal"><b><span lang=3D"FR" style=3D"color:rgb(99,102,106);=
font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;font-size:9pt"><u></u>=
=C2=A0<u></u></span></b></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<p><br>
This e-mail, including attachments, may include confidential and/or<br>
proprietary information, and may be used only by the person or entity<br>
to which it is addressed. If the reader of this e-mail is not the intended<=
br>
recipient or his or her authorized agent, the reader is hereby notified<br>
that any dissemination, distribution or copying of this e-mail is<br>
prohibited. If you have received this e-mail in error, please notify the<br=
>
sender by replying to this message and delete this e-mail immediately.</p>

</div></blockquote></div></div></div></blockquote></div><div class=3D"gmail=
_extra"><br></div></div>

--001a114406ea899398052ec68773--