Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ana.gillan@gmail.com designates
 209.85.212.170 as permitted sender)
User-Agent: Microsoft-MacOutlook/14.4.3.140616
Date: Sat, 02 Aug 2014 16:24:09 +0100
Subject: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
From: Ana Gillan <ana.gillan@gmail.com>
To: <user@hadoop.apache.org>
Message-ID: <D002C11B.3C6A%ana.gillan@gmail.com>
Thread-Topic: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
Mime-version: 1.0
Content-type: multipart/alternative;
	boundary="B_3489841453_17761041"

> This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

--B_3489841453_17761041
Content-type: text/plain;
	charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable

Hi everyone,

I am having an issue with MapReduce jobs running through Hive being killed
after 600s timeouts and with very simple jobs taking over 3 hours (or just
failing) for a set of files with a compressed size of only 1-2gb. I will tr=
y
and provide as much information as I can here, so if someone can help, that
would be really great.

I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
> =80 Master node:
>=20
> =AD 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>=20
> =AD 64GB DDR3 SDRAM
>=20
> =AD 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>=20
> =80 Slave nodes (each):
>=20
> =AD Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>=20
> =AD 32GB DDR3 SDRAM
>=20
> =AD 4 x 2TB SATA-3 hard drive
>=20
> =80 Operating system on all nodes: openSUSE Linux 13.1

We have the Apache BigTop package version 0.7, with Hadoop version
2.0.6-alpha and Hive version 0.11.
YARN has been configured as per these recommendations:
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

I also set the following additional settings before running jobs:
set yarn.nodemanager.resource.cpu-vcores=3D4;
set mapred.tasktracker.map.tasks.maximum=3D4;
set hive.hadoop.supports.splittable.combineinputformat=3Dtrue;
set hive.merge.mapredfiles=3Dtrue;

No one else uses this cluster while I am working.

What I=B9m trying to do:
I have a bunch of XML files on HDFS, which I am reading into Hive using thi=
s
SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
series of tables from these files and finally run a Python script on one of
them to perform some scientific calculations. The files are .xml.gz format
and (uncompressed) are only about 4mb in size each. hive.input.format is se=
t
to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the
=B3small files problem.=B2

Problems:
My HQL statements work perfectly for up to 1000 of these files. Even for
much larger numbers, doing select * works fine, which means the files are
being read properly, but if I do something as simple as selecting just one
column from the whole table for a larger number of files, containers start
being killed and jobs fail with this error in the container logs:

2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tm=
p
.-ext-10001/_tmp.000000_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod=
e
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tm=
p
.-ext-10001/_tmp.000000_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
any open files.
at=20
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem=
.
java:2398)

Killed jobs show the above and also the following message:
AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
secsContainer killed by the ApplicationMaster.

Also, in the node logs, I get a lot of pings like this:
INFO [IPC Server handler 17 on 40961]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0362_m_000002_0

For 5000 files (1gb compressed), the selection of a single column finishes,
but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
then errors out.

While the jobs are running, I notice that the containers are not evenly
distributed across the cluster. Some nodes lie idle, while the application
master node runs 7 containers, maxing out the 28gb of RAM allocated to
Hadoop on each slave node.

This is the output of netstat =ADi while the column selection is running:
Kernel Interface table

Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
Flg

eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
BMRU

eth1   1500   0 77410508      0      0      0 40815746      0      0      0
BMRU

lo    65536   0 16593808      0      0      0 16593808      0      0      0
LRU


Are there some settings I am missing that mean the cluster isn=B9t processing
this data as efficiently as it can?

I am very new to Hadoop and there are so many logs, etc, that
troubleshooting can be a bit overwhelming. Where else should I be looking t=
o
try and diagnose what is wrong?

Thanks in advance for any help you can give!

Kind regards,
Ana=20


--B_3489841453_17761041
Content-type: text/html;
	charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable

<html><head>
		<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8">
		<title></title>
	</head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webk=
it-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font=
-family: Calibri, sans-serif;"><div style=3D"font-family: Calibri, sans-serif;=
 color: rgb(0, 0, 0); font-size: 14px;">Hi everyone,</div><div style=3D"font-f=
amily: Calibri, sans-serif; color: rgb(0, 0, 0); font-size: 14px;"><br></div=
><div style=3D"font-family: Calibri, sans-serif; color: rgb(0, 0, 0); font-siz=
e: 14px;">I am having an issue with MapReduce jobs running through Hive bein=
g killed after 600s timeouts and with very simple jobs taking over 3 hours (=
or just failing) for a set of files with a compressed size of only 1-2gb. I =
will try and provide as much information as I can here, so if someone can he=
lp, that would be really great.</div><div style=3D"font-family: Calibri, sans-=
serif; color: rgb(0, 0, 0); font-size: 14px;"><br></div><div style=3D"font-fam=
ily: Calibri, sans-serif; color: rgb(0, 0, 0); font-size: 14px;">I have a cl=
uster of 7 nodes (1 master, 6 slaves) with the following config:</div><div s=
tyle=3D"font-family: Calibri, sans-serif; color: rgb(0, 0, 0);">


		<div class=3D"page" title=3D"Page 19">
			<div class=3D"layoutArea">
				<div class=3D"column">
					<blockquote style=3D"margin:0 0 0 40px; border:none; padding:0px;"><p>&#=
8226; Master node:</p><p>&#8211; 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz</p=
><p>&#8211; 64GB DDR3 SDRAM</p><p>&#8211; 8 x 2TB SAS 600 hard drive (arrang=
ed as RAID 1 and RAID 5)</p><p>&#8226; Slave nodes (each):</p><p>&#8211; Int=
el Xeon 4-core E3-1220v3 @ 3.1GHz</p><p>&#8211; 32GB DDR3 SDRAM</p><p>&#8211=
; 4 x 2TB SATA-3 hard drive</p><p>&#8226; Operating system on all nodes: ope=
nSUSE Linux 13.1<span style=3D"font-family: CMR10;">&nbsp;</span></p></blockqu=
ote>
				</div>
			</div>
		</div></div><div style=3D"font-family: Calibri, sans-serif; color: rgb(0, 0=
, 0); font-size: 14px;">We have the Apache BigTop package version 0.7, with =
Hadoop version 2.0.6-alpha and Hive version 0.11.</div><div style=3D"font-fami=
ly: Calibri, sans-serif; color: rgb(0, 0, 0); font-size: 14px;">YARN has bee=
n configured as per these recommendations:&nbsp;<a href=3D"http://hortonworks.=
com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/">http://hortonworks.com/=
blog/how-to-plan-and-configure-yarn-in-hdp-2-0/</a></div><div style=3D"font-fa=
mily: Calibri, sans-serif; color: rgb(0, 0, 0); font-size: 14px;"><br></div>=
<div style=3D"font-family: Calibri, sans-serif; color: rgb(0, 0, 0); font-size=
: 14px;">I also set the following additional settings before running jobs:</=
div><div style=3D"font-family: Calibri, sans-serif; color: rgb(0, 0, 0); font-=
size: 14px;"><div>set yarn.nodemanager.resource.cpu-vcores=3D4;</div><div>set =
mapred.tasktracker.map.tasks.maximum=3D4;</div><div>set hive.hadoop.supports.s=
plittable.combineinputformat=3Dtrue;</div><div>set hive.merge.mapredfiles=3Dtrue=
;</div></div><div style=3D"font-family: Calibri, sans-serif; color: rgb(0, 0, =
0); font-size: 14px;"><br></div><div style=3D"font-family: Calibri, sans-serif=
; color: rgb(0, 0, 0); font-size: 14px;">No one else uses this cluster while=
 I am working.</div><div style=3D"font-family: Calibri, sans-serif; color: rgb=
(0, 0, 0); font-size: 14px;"><br></div><div style=3D"font-family: Calibri, san=
s-serif; color: rgb(0, 0, 0); font-size: 14px;">What I&#8217;m trying to do:=
</div><div><font face=3D"Calibri,sans-serif">I have a bunch of XML files on HD=
FS, which I am reading into Hive using this SerDe&nbsp;</font><a href=3D"https=
://github.com/dvasilen/Hive-XML-SerDe" style=3D"font-family: Calibri, sans-ser=
if; font-size: 14px; color: rgb(0, 0, 0);">https://github.com/dvasilen/Hive-=
XML-SerDe</a><font face=3D"Calibri,sans-serif">. I then want to create a serie=
s of tables from these files and finally run a Python script on one of them =
to perform some scientific calculations. The files are .xml.gz format and (u=
ncompressed) are only about 4mb in size each.&nbsp;</font>hive.input.format =
is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid=
 the &#8220;small files problem.&#8221;&nbsp;</div><div style=3D"font-family: =
Calibri, sans-serif; color: rgb(0, 0, 0); font-size: 14px;"><br></div><div s=
tyle=3D"font-family: Calibri, sans-serif; color: rgb(0, 0, 0); font-size: 14px=
;">Problems:</div><div style=3D"font-family: Calibri, sans-serif; color: rgb(0=
, 0, 0); font-size: 14px;">My HQL statements work perfectly for up to 1000 o=
f these files. Even for much larger numbers, doing select * works fine, whic=
h means the files are being read properly, but if I do something as simple a=
s selecting just one column from the whole table for a larger number of file=
s, containers start being killed and jobs fail with this error in the contai=
ner logs:</div><div style=3D"font-family: Calibri, sans-serif; color: rgb(0, 0=
, 0); font-size: 14px;"><br></div><div><div style=3D"color: rgb(0, 0, 0); font=
-family: Calibri, sans-serif; font-size: 14px;">2014-08-02 14:51:45,137 ERRO=
R [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hiv=
e-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10=
001/_tmp.000000_0</div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri=
, sans-serif; font-size: 14px;">org.apache.hadoop.ipc.RemoteException(org.ap=
ache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hi=
ve-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-1=
0001/_tmp.000000_0: File does not exist. Holder DFSClient_attempt_1403771939=
632_0402_m_000000_0_-1627633686_1 does not have any open files.</div><div st=
yle=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px;=
"><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at org.apache=
.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)=
</div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; fon=
t-size: 14px;"><br></div><div style=3D"color: rgb(0, 0, 0); font-family: Calib=
ri, sans-serif; font-size: 14px;">Killed jobs show the above and also the fo=
llowing message:&nbsp;</div><div style=3D"color: rgb(0, 0, 0); font-family: Ca=
libri, sans-serif; font-size: 14px;">AttemptID:attempt_1403771939632_0402_m_=
000000_0 Timed out after 600 secsContainer killed by the ApplicationMaster.&=
nbsp;</div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif=
; font-size: 14px;"><br></div><div style=3D"color: rgb(0, 0, 0); font-family: =
Calibri, sans-serif; font-size: 14px;">Also, in the node logs, I get a lot o=
f pings like this:</div><div style=3D"color: rgb(0, 0, 0); font-family: Calibr=
i, sans-serif; font-size: 14px;"><span style=3D"font-family: DejaVuSansMono;">=
INFO [IPC Server handler 17 on 40961] org.apache.hadoop.mapred.TaskAttemptLi=
stenerImpl: Ping from attempt_1403771939632_0362_m_000002_0</span></div><div=
 style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14=
px;"><span style=3D"font-family: DejaVuSansMono;"><br></span></div><div><font =
face=3D"DejaVuSansMono">For 5000 files (1gb compressed), the selection of a si=
ngle column finishes, but takes over 3 hours. For 10,000 files, the job hang=
s on about 4% map and then errors out.</font></div><div style=3D"color: rgb(0,=
 0, 0); font-family: Calibri, sans-serif; font-size: 14px;"><span style=3D"fon=
t-family: DejaVuSansMono;"><br></span></div><div><font face=3D"DejaVuSansMono"=
>While the jobs are running,&nbsp;I notice that the containers are not evenl=
y distributed across the cluster. Some nodes lie idle, while the application=
 master node runs 7 containers, maxing out the 28gb of RAM allocated to&nbsp=
;Hadoop on each slave node.</font></div><div><font face=3D"DejaVuSansMono"><br=
></font></div><div>This is the output of netstat &#8211;i while the column s=
election is running:</div><div><p style=3D"margin: 0px; font-size: 13px; font-=
family: 'Andale Mono';">Kernel Interface table</p><p style=3D"margin: 0px; fon=
t-size: 13px; font-family: 'Andale Mono';">Iface &nbsp; MTU Met&nbsp; &nbsp;=
 RX-OK RX-ERR RX-DRP RX-OVR&nbsp; &nbsp; TX-OK TX-ERR TX-DRP TX-OVR Flg</p><=
p style=3D"margin: 0px; font-size: 13px; font-family: 'Andale Mono';">eth0 &nb=
sp; 1500 &nbsp; 0 79515196&nbsp; &nbsp; &nbsp; 0 2265807&nbsp; &nbsp; &nbsp;=
0 45694758&nbsp; &nbsp; &nbsp;&nbsp;0&nbsp; &nbsp; &nbsp;&nbsp;0&nbsp; &nbsp=
; &nbsp;&nbsp;0 BMRU</p><p style=3D"margin: 0px; font-size: 13px; font-family:=
 'Andale Mono';">eth1 &nbsp; 1500 &nbsp; 0 77410508&nbsp; &nbsp; &nbsp; 0&nb=
sp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; 0 40815746&nbsp; &nbsp; &nbsp; 0&nbs=
p; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; 0 BMRU</p><p style=3D"margin: 0px; font=
-size: 13px; font-family: 'Andale Mono';">lo&nbsp; &nbsp; 65536 &nbsp; 0 165=
93808&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; 0 1659=
3808&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; 0 LRU</=
p></div><div><font face=3D"DejaVuSansMono"><br></font></div><div style=3D"color:=
 rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px;"><br></div=
><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-siz=
e: 14px;"><br></div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, s=
ans-serif; font-size: 14px;"><br></div><div style=3D"color: rgb(0, 0, 0); font=
-family: Calibri, sans-serif; font-size: 14px;">Are there some settings I am=
 missing that mean the cluster isn&#8217;t processing this data as efficient=
ly as it can?</div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, sa=
ns-serif; font-size: 14px;"><br></div><div style=3D"color: rgb(0, 0, 0); font-=
family: Calibri, sans-serif; font-size: 14px;">I am very new to Hadoop and t=
here are so many logs, etc, that troubleshooting can be a bit overwhelming. =
Where else should I be looking to try and diagnose what is wrong?</div><div =
style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14p=
x;"><br></div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-se=
rif; font-size: 14px;">Thanks in advance for any help you can give!</div><di=
v style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 1=
4px;"><br></div><div style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-=
serif; font-size: 14px;">Kind regards,</div><div style=3D"color: rgb(0, 0, 0);=
 font-family: Calibri, sans-serif; font-size: 14px;">Ana&nbsp;</div><div sty=
le=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px;"=
><br></div></div></body></html>

--B_3489841453_17761041--