Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6ADE11425 for ; Sat, 2 Aug 2014 15:24:46 +0000 (UTC) Received: (qmail 32006 invoked by uid 500); 2 Aug 2014 15:24:42 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 31863 invoked by uid 500); 2 Aug 2014 15:24:42 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 31844 invoked by uid 99); 2 Aug 2014 15:24:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Aug 2014 15:24:41 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ana.gillan@gmail.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Aug 2014 15:24:39 +0000 Received: by mail-wi0-f170.google.com with SMTP id f8so4168153wiw.1 for ; Sat, 02 Aug 2014 08:24:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=user-agent:date:subject:from:to:message-id:thread-topic :mime-version:content-type; bh=rhFLrxpqKt8Yp2pZjKalGSm4QyHLLl5PtQ5QVUDwcXc=; b=quRE55Rh2Ydh76jx+VBVxuhKbmiRe3OY93KL8F4ihRGOvV/t5DdSUp1rGROgX2jogJ e2SYz2PU7lpDkWkU0O+W/nhw6TVOiXa7OhS4yzvFGRQ10ndVfYIwSuR8+VHExJzkxUeP XgcUs/iWZrN6GQRKVWNtbiXYfr+vPvTPk1mKwDsg1UpnhI2ADdRq2pDy3zJ0uIq9Lquy AB+GYUmo183vNTkXB3E0HJSKCXnKU5TLf4O8cD2uquByhxlJiEEQffrv/B1XGafTvBxr qY/mINNuCyyz2U7RCGfHXJgAx1/3SDhVVW/zdpubtb/BQl2BbKzTPSCXXPrwBZ2eyB7i KcTA== X-Received: by 10.194.86.225 with SMTP id s1mr18334375wjz.21.1406993054646; Sat, 02 Aug 2014 08:24:14 -0700 (PDT) Received: from [192.168.0.7] (97e76d1f.skybroadband.com. [151.231.109.31]) by mx.google.com with ESMTPSA id y10sm16780781wie.18.2014.08.02.08.24.12 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sat, 02 Aug 2014 08:24:13 -0700 (PDT) User-Agent: Microsoft-MacOutlook/14.4.3.140616 Date: Sat, 02 Aug 2014 16:24:09 +0100 Subject: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) From: Ana Gillan To: Message-ID: Thread-Topic: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3489841453_17761041" X-Virus-Checked: Checked by ClamAV on apache.org > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3489841453_17761041 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Hi everyone, I am having an issue with MapReduce jobs running through Hive being killed after 600s timeouts and with very simple jobs taking over 3 hours (or just failing) for a set of files with a compressed size of only 1-2gb. I will tr= y and provide as much information as I can here, so if someone can help, that would be really great. I have a cluster of 7 nodes (1 master, 6 slaves) with the following config: > =80 Master node: >=20 > =AD 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz >=20 > =AD 64GB DDR3 SDRAM >=20 > =AD 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5) >=20 > =80 Slave nodes (each): >=20 > =AD Intel Xeon 4-core E3-1220v3 @ 3.1GHz >=20 > =AD 32GB DDR3 SDRAM >=20 > =AD 4 x 2TB SATA-3 hard drive >=20 > =80 Operating system on all nodes: openSUSE Linux 13.1 We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha and Hive version 0.11. YARN has been configured as per these recommendations: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ I also set the following additional settings before running jobs: set yarn.nodemanager.resource.cpu-vcores=3D4; set mapred.tasktracker.map.tasks.maximum=3D4; set hive.hadoop.supports.splittable.combineinputformat=3Dtrue; set hive.merge.mapredfiles=3Dtrue; No one else uses this cluster while I am working. What I=B9m trying to do: I have a bunch of XML files on HDFS, which I am reading into Hive using thi= s SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a series of tables from these files and finally run a Python script on one of them to perform some scientific calculations. The files are .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format is se= t to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the =B3small files problem.=B2 Problems: My HQL statements work perfectly for up to 1000 of these files. Even for much larger numbers, doing select * works fine, which means the files are being read properly, but if I do something as simple as selecting just one column from the whole table for a larger number of files, containers start being killed and jobs fail with this error in the container logs: 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tm= p .-ext-10001/_tmp.000000_0 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException): No lease on /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tm= p .-ext-10001/_tmp.000000_0: File does not exist. Holder DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have any open files. at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem= . java:2398) Killed jobs show the above and also the following message: AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600 secsContainer killed by the ApplicationMaster. Also, in the node logs, I get a lot of pings like this: INFO [IPC Server handler 17 on 40961] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0362_m_000002_0 For 5000 files (1gb compressed), the selection of a single column finishes, but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and then errors out. While the jobs are running, I notice that the containers are not evenly distributed across the cluster. Some nodes lie idle, while the application master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop on each slave node. This is the output of netstat =ADi while the column selection is running: Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 79515196 0 2265807 0 45694758 0 0 0 BMRU eth1 1500 0 77410508 0 0 0 40815746 0 0 0 BMRU lo 65536 0 16593808 0 0 0 16593808 0 0 0 LRU Are there some settings I am missing that mean the cluster isn=B9t processing this data as efficiently as it can? I am very new to Hadoop and there are so many logs, etc, that troubleshooting can be a bit overwhelming. Where else should I be looking t= o try and diagnose what is wrong? Thanks in advance for any help you can give! Kind regards, Ana=20 --B_3489841453_17761041 Content-type: text/html; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable
Hi everyone,

I am having an issue with MapReduce jobs running through Hive bein= g killed after 600s timeouts and with very simple jobs taking over 3 hours (= or just failing) for a set of files with a compressed size of only 1-2gb. I = will try and provide as much information as I can here, so if someone can he= lp, that would be really great.

I have a cl= uster of 7 nodes (1 master, 6 slaves) with the following config:

&#= 8226; Master node:

– 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz

– 64GB DDR3 SDRAM

– 8 x 2TB SAS 600 hard drive (arrang= ed as RAID 1 and RAID 5)

• Slave nodes (each):

– Int= el Xeon 4-core E3-1220v3 @ 3.1GHz

– 32GB DDR3 SDRAM

–= ; 4 x 2TB SATA-3 hard drive

• Operating system on all nodes: ope= nSUSE Linux 13.1 

We have the Apache BigTop package version 0.7, with = Hadoop version 2.0.6-alpha and Hive version 0.11.
YARN has bee= n configured as per these recommendations: http://hortonworks.com/= blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

=
I also set the following additional settings before running jobs:
set yarn.nodemanager.resource.cpu-vcores=3D4;
set = mapred.tasktracker.map.tasks.maximum=3D4;
set hive.hadoop.supports.s= plittable.combineinputformat=3Dtrue;
set hive.merge.mapredfiles=3Dtrue= ;

No one else uses this cluster while= I am working.

What I’m trying to do:=
I have a bunch of XML files on HD= FS, which I am reading into Hive using this SerDe https://github.com/dvasilen/Hive-= XML-SerDe. I then want to create a serie= s of tables from these files and finally run a Python script on one of them = to perform some scientific calculations. The files are .xml.gz format and (u= ncompressed) are only about 4mb in size each. hive.input.format = is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid= the “small files problem.” 

Problems:
My HQL statements work perfectly for up to 1000 o= f these files. Even for much larger numbers, doing select * works fine, whic= h means the files are being read properly, but if I do something as simple a= s selecting just one column from the whole table for a larger number of file= s, containers start being killed and jobs fail with this error in the contai= ner logs:

2014-08-02 14:51:45,137 ERRO= R [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hiv= e-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10= 001/_tmp.000000_0
org.apache.hadoop.ipc.RemoteException(org.ap= ache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hi= ve-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-1= 0001/_tmp.000000_0: File does not exist. Holder DFSClient_attempt_1403771939= 632_0402_m_000000_0_-1627633686_1 does not have any open files.
at org.apache= .hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)=

Killed jobs show the above and also the fo= llowing message: 
AttemptID:attempt_1403771939632_0402_m_= 000000_0 Timed out after 600 secsContainer killed by the ApplicationMaster.&= nbsp;

Also, in the node logs, I get a lot o= f pings like this:
= INFO [IPC Server handler 17 on 40961] org.apache.hadoop.mapred.TaskAttemptLi= stenerImpl: Ping from attempt_1403771939632_0362_m_000002_0

For 5000 files (1gb compressed), the selection of a si= ngle column finishes, but takes over 3 hours. For 10,000 files, the job hang= s on about 4% map and then errors out.

While the jobs are running, I notice that the containers are not evenl= y distributed across the cluster. Some nodes lie idle, while the application= master node runs 7 containers, maxing out the 28gb of RAM allocated to = ;Hadoop on each slave node.
This is the output of netstat –i while the column s= election is running:

Kernel Interface table

Iface   MTU Met   = RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg

<= p style=3D"margin: 0px; font-size: 13px; font-family: 'Andale Mono';">eth0 &nb= sp; 1500   0 79515196      0 2265807     = 0 45694758      0      0   = ;   0 BMRU

eth1   1500   0 77410508      0&nb= sp;     0      0 40815746      0&nbs= p;     0      0 BMRU

lo    65536   0 165= 93808      0      0      0 1659= 3808      0      0      0 LRU





Are there some settings I am= missing that mean the cluster isn’t processing this data as efficient= ly as it can?

I am very new to Hadoop and t= here are so many logs, etc, that troubleshooting can be a bit overwhelming. = Where else should I be looking to try and diagnose what is wrong?

Thanks in advance for any help you can give!

Kind regards,
Ana 

--B_3489841453_17761041--