I’m not sure which user is fetching the data, but I’m assuming no one changed that from the default. The data isn’t huge in size, just in number, so I suppose the open files limit is not the issue?

I’m running the job again with mapred.task.timeout=1200000, but containers are still being killed in the same way… Just without the timeout message. And it somehow massively slowed down the machine as well, so even typing commands took a long time (???)

I’m not sure what you mean by which stage it’s getting killed on. If you mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed and killed jobs always start fine with lots of these “processing file” and “processing alias” statements, but then suddenly warn about a DataStreamer Exception and then are killed with an error, which is the same as the warning. Not sure if this exception is the actual issue or if it’s just a knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)


Thanks a lot for your attention!	

From: hadoop hive <hadoophive@gmail.com>
Reply-To: <user@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <user@hadoop.apache.org>
Subject: Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

32k seems fine for mapred user(hope you using this for fetching you data) but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks