hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hadoop hive <hadooph...@gmail.com>
Subject Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
Date Sat, 02 Aug 2014 16:36:12 GMT
32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
 On Aug 2, 2014 9:38 PM, "Ana Gillan" <ana.gillan@gmail.com> wrote:

> Filemax across the cluster is set to over 6 million. I’ve checked the
> open file limits for the accounts used by the Hadoop daemons
>  and they have an open file limit of 32K. This is confirmed by the various
> .out files, e.g.
>
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
>
> Contains open files (-n) 32768. Is this too low? What is the recommended
> value for open files on all nodes? Also does my own user need to have the
> same value?
>
> I’ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally
> 2gb compressed), but this job won’t progress past 0% map.
>
> From: Ana Gillan <ana.gillan@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <user@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> For my own user? It is as follows:
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 483941
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 1024
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 8192
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 800
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
> From: hadoop hive <hadoophive@gmail.com>
> Reply-To: <user@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <user@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> Can you check the ulimit for tour user. Which might be causing this.
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <ana.gillan@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am having an issue with MapReduce jobs running through Hive being
>> killed after 600s timeouts and with very simple jobs taking over 3 hours
>> (or just failing) for a set of files with a compressed size of only 1-2gb.
>> I will try and provide as much information as I can here, so if someone can
>> help, that would be really great.
>>
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following
>> config:
>>
>> • Master node:
>>
>> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>
>> – 64GB DDR3 SDRAM
>>
>> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>
>> • Slave nodes (each):
>>
>> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>
>> – 32GB DDR3 SDRAM
>>
>> – 4 x 2TB SATA-3 hard drive
>>
>> • Operating system on all nodes: openSUSE Linux 13.1
>>
>>  We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>>
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>>
>> No one else uses this cluster while I am working.
>>
>> What I’m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using
>> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
>> create a series of tables from these files and finally run a Python script
>> on one of them to perform some scientific calculations. The files are
>> .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format
>> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
>> avoid the “small files problem.”
>>
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for
>> much larger numbers, doing select * works fine, which means the files are
>> being read properly, but if I do something as simple as selecting just one
>> column from the whole table for a larger number of files, containers start
>> being killed and jobs fail with this error in the container logs:
>>
>> 2014-08-02 14:51:45,137 ERROR [Thread-3]
>> org.apache.hadoop.hdfs.DFSClient: Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
>> File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>>
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>>
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>>
>> For 5000 files (1gb compressed), the selection of a single column
>> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
>> 4% map and then errors out.
>>
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated
>> to Hadoop on each slave node.
>>
>> This is the output of netstat –i while the column selection is running:
>>
>> Kernel Interface table
>>
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>>
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>>   0 BMRU
>>
>> eth1   1500   0 77410508      0      0      0 40815746      0      0
>> 0 BMRU
>>
>> lo    65536   0 16593808      0      0      0 16593808      0      0
>> 0 LRU
>>
>>
>>
>>
>> Are there some settings I am missing that mean the cluster isn’t
>> processing this data as efficiently as it can?
>>
>> I am very new to Hadoop and there are so many logs, etc, that
>> troubleshooting can be a bit overwhelming. Where else should I be looking
>> to try and diagnose what is wrong?
>>
>> Thanks in advance for any help you can give!
>>
>> Kind regards,
>> Ana
>>
>>

Mime
View raw message