hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ana Gillan <ana.gil...@gmail.com>
Subject Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
Date Sat, 02 Aug 2014 15:36:16 GMT
For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <hadoophive@gmail.com>
Reply-To:  <user@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <user@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <ana.gillan@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Mime
View raw message