Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of Keshav.C.Savant@fisglobal.com
 designates 199.200.24.190 as permitted sender)
Content-Class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01CCB406.5B985953"
Subject: Hive query taking too much time
Date: Tue, 6 Dec 2011 16:30:45 +0530
Message-ID: <651A7A4AE5BD734D885D3A36D8A5538902EF6195@SMBFISBOM01.FNFIS.COM>
Thread-Topic: Hive query taking too much time
Thread-Index: Acy0BJeYHFyp6ZhgStaCHIH8hDRSZQ==
From: "Savant, Keshav" <Keshav.C.Savant@fisglobal.com>
To: <user@hive.apache.org>

------_=_NextPart_001_01CCB406.5B985953
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi All,

=20

My setup is=20

hadoop-0.20.203.0

hive-0.7.1

=20

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

=20

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.

=20

The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.

=20

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data.=20

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

=20

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

=20

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

=20

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated.=20

=20

Keshav

_____________
The information contained in this message is proprietary and/or confidentia=
l. If you are not the intended recipient, please: (i) delete the message an=
d all copies; (ii) do not disclose, distribute or use the message in any ma=
nner; and (iii) notify the sender immediately. In addition, please be aware=
 that any message addressed to our domain is subject to archiving and revie=
w by persons other than the intended recipient. Thank you.

------_=_NextPart_001_01CCB406.5B985953
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head><meta http-equiv=3DContent-Type content=
=3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=
oft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
span.EmailStyle19
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue vli=
nk=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal>Hi All,<o:p></o:=
p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>My set=
up is <o:p></o:p></p><p class=3DMsoNormal style=3D'text-indent:.5in'>hadoop=
-0.20.203.0<o:p></o:p></p><p class=3DMsoNormal style=3D'text-indent:.5in'>h=
ive-0.7.1<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=
=3DMsoNormal>I am having a total of 5 node cluster: 4 data nodes, 1 namenod=
e (it is also acting as secondary name node). On namenode I have setup hive=
 with HiveDerbyServerMode to support multiple hive server connection.<o:p><=
/o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>I h=
ave inserted plain text CSV files in HDFS using &#8216;LOAD DATA&#8217; hiv=
e query statements, total number of files is 2624 an their combined size is=
 only 713 MB, which is very less from Hadoop perspective that can handle TB=
s of data very easily.<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p>=
</p><p class=3DMsoNormal>The problem is, when I run a simple count query (i=
.e. <b><i>select count(*) from a_table</i></b>), it takes too much time in =
executing the query.<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></=
p><p class=3DMsoNormal>For instance it takes almost 17 minutes to execute t=
he said query if the table has 950,000 rows, I understand that time is too =
much for executing a query with only such small data. <o:p></o:p></p><p cla=
ss=3DMsoNormal>This is only a dev environment and in production environment=
 the number of files and their combined size will move into millions and GB=
s respectively.<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>On analyzing the logs on all the datanodes and namenode/s=
econdary namenode I do not find any error in them.<o:p></o:p></p><p class=
=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>I have tried setting=
 mapred.reduce.tasks to a fixed number also, but number of reduce always re=
mains 1 while number of maps is determined by hive only.<o:p></o:p></p><p c=
lass=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Any suggestion w=
hat I am doing wrong, or how can I improve the performance of hive queries?=
 Any suggestion or pointer is highly appreciated. <o:p></o:p></p><p class=
=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Keshav<o:p></o:p></p=
></div>
<DIV>
_____________<BR>
The information contained in this message is proprietary and/or confidentia=
l. If you are not the intended recipient, please: (i) delete the message an=
d all copies; (ii) do not disclose, distribute or use the message in any ma=
nner; and (iii) notify the sender immediately. In addition, please be aware=
 that any message addressed to our domain is subject to archiving and revie=
w by persons other than the intended recipient. Thank you.<BR>
</DIV></body></html>

------_=_NextPart_001_01CCB406.5B985953--