Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6571111B83 for ; Mon, 4 Aug 2014 11:53:39 +0000 (UTC) Received: (qmail 17851 invoked by uid 500); 4 Aug 2014 11:53:28 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 17742 invoked by uid 500); 4 Aug 2014 11:53:28 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 17732 invoked by uid 99); 4 Aug 2014 11:53:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Aug 2014 11:53:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ana.gillan@gmail.com designates 209.85.212.171 as permitted sender) Received: from [209.85.212.171] (HELO mail-wi0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Aug 2014 11:53:07 +0000 Received: by mail-wi0-f171.google.com with SMTP id hi2so4877523wib.4 for ; Mon, 04 Aug 2014 04:52:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=user-agent:date:subject:from:to:message-id:thread-topic:references :in-reply-to:mime-version:content-type; bh=8PV9HqUDrfD3w8MU1Z5EKXE5kwFvczdjzxtR3LWdx5E=; b=Pt4kuUh+c9EkVLNQD6KS/9koWKSkSvRFA9GA4cXnyeJTA6G5PM/0H5iyeZLTnfyWJB FCQKtVRUY3JQlF9y7/dl42jsxzKQMhhXpzvdu11aDUgpqjPb4Zg+XpGyLKYXTm5suE+r DPTXwrlhLxoR/d7LTcyGIOqrQact8RWw+1UikRIPwVMowlzcR/S2bj8PIVYZoiceYyYW s/BuY+YYFhvECev8NvEp2a4gbU6cPfFYyXdkctZoVATPYTIG4b8J41GiqcCGCidIjs4s 9bIpmK1gg6MCoWZAAqh7B9C59IjuQpdwMVfEGfxQLD94lHAWrAGLFNTeflk0WOhnDBM/ tvNQ== X-Received: by 10.194.57.132 with SMTP id i4mr30010934wjq.6.1407153158389; Mon, 04 Aug 2014 04:52:38 -0700 (PDT) Received: from [192.168.0.7] (97e76d1f.skybroadband.com. [151.231.109.31]) by mx.google.com with ESMTPSA id da9sm40112232wib.5.2014.08.04.04.52.30 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 04 Aug 2014 04:52:37 -0700 (PDT) User-Agent: Microsoft-MacOutlook/14.4.3.140616 Date: Mon, 04 Aug 2014 12:52:14 +0100 Subject: Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) From: Ana Gillan To: Message-ID: Thread-Topic: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) References: In-Reply-To: Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3490001552_20879299" X-Virus-Checked: Checked by ClamAV on apache.org > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3490001552_20879299 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Hi, Unfortunately, after I set my user=B9s ulimit =ADn to 65536, I still get the same bad performance, killed containers and errors as before. I collected together a bunch of logs around the moment when the containers are being killed (application master log, killed container log, hadoop-hdfs logs, hadoop-yarn logs) and this is what=B9s happening in order. The PrivilegedActionException, is that something? Is my user not in some sort o= f correct group? Thanks in advance! Ana 2014-08-04 12:21:47,868 INFO [IPC Server handler 4 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000006_0 2014-08-04 12:21:47,979 INFO [IPC Server handler 5 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000000_0 2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000002_0 is : 0.061734322 2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attempt_1403771939632_0413_m_000002_0 2014-08-04 12:21:48,110 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock:=20 /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_1. BP-1198908146-10.0.0.11-1399888065184 blk_-300655159790361685_67008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW], ReplicaUnderConstruction[10.0.0.43:50010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} 2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.0.0.40:50010 is added to blk_-300655159790361685_67008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW], ReplicaUnderConstruction[10.0.0.43:50010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0 2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.0.0.43:50010 is added to blk_-300655159790361685_67008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW], ReplicaUnderConstruction[10.0.0.43:50010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0 2014-08-04 12:21:48,146 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.0.0.44:50010 is added to blk_-300655159790361685_67008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW], ReplicaUnderConstruction[10.0.0.43:50010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0 2014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile:=20 /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_1 is closed by DFSClient_attempt_1403771939632_0413_m_000006_1_-1142346950_1 2014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 422 Total time for transactions(ms): 5 Number of transactions batched in Syncs: 0 Number of syncs: 186 SyncTimes(ms): 204 2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000006_1 is : 0.93497133 2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attempt_1403771939632_0413_m_000006_1 2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000006_1 is : 1.0 2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attempt_1403771939632_0413_m_000006_1 2014-08-04 12:21:48,177 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from RUNNING to SUCCESS_CONTAINER_CLEANUP 2014-08-04 12:21:48,177 INFO [IPC Server handler 9 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from attempt_1403771939632_0413_m_000006_1 2014-08-04 12:21:48,178 INFO [ContainerLauncher #4] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1403771939632_0413_m_000006_1 2014-08-04 12:21:48,178 INFO [ContainerLauncher #4] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1403771939632_0413_01_000016 taskAttempt attempt_1403771939632_0413_m_000006_1 2014-08-04 12:21:48,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED 2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2 2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from RUNNING to KILL_CONTAINER_CLEANUP 2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Issuing kill to other attempt attempt_1403771939632_0413_m_000006_0 2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1403771939632_0413_m_000006_1 2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1403771939632_0413_m_000006 Task Transitioned from RUNNING to SUCCEEDE= D 2014-08-04 12:21:48,197 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1403771939632_0413_m_000006_0 2014-08-04 12:21:48,197 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1403771939632_0413_01_000009 taskAttempt attempt_1403771939632_0413_m_000006_0 2014-08-04 12:21:48,219 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from KILL_CONTAINER_CLEANUP to KILL_TASK_CLEANUP 2014-08-04 12:21:48,230 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT 2014-08-04 12:21:48,231 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from KILL_TASK_CLEANUP to KILLED 2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000002_1 is : 0.047582902 2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attempt_1403771939632_0413_m_000002_1 2014-08-04 12:21:48,370 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:zslf023 (auth:SIMPLE) cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No leas= e on=20 /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have any open files. 2014-08-04 12:21:48,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.0.0.42:41375: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have any open files. 2014-08-04 12:21:48,378 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException): No lease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have any open files. 2014-08-04 12:21:48,379 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_0 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException): No lease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tm= p .-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have any open files. 2014-08-04 12:21:48,383 INFO [IPC Server handler 11 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000000_1 2014-08-04 12:21:48,620 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:14 AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0 HostLocal:5 RackLocal:10 2014-08-04 12:21:49,198 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1403771939632_0413_01_000016 Container Transitioned from RUNNING to COMPLETED 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=3Dzslf023 OPERATION=3DAM Released Container TARGET=3DSchedulerApp RESULT=3DSUCCESS APPID=3Dapplication_1403771939632_0413 CONTAINERID=3Dcontainer_1403771939632_0413_01_000016 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySc= h eduler: Application appattempt_1403771939632_0413_000001 released container container_1403771939632_0413_01_000016 on node: host: slave4.hadoop.local:57340 #containers=3D1 available=3D24576 used=3D4096 with event: FINISHED 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=3DContainer: [ContainerId: container_1403771939632_0413_01_000016, NodeId: slave4.hadoop.local:57340, NodeHttpAddress: slave4.hadoop.local:8042, Resource: , Priority: 20, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 413, cluster_timestamp: 1403771939632, }, attemptId: 1, }, id: 16, }, state: C_COMPLETE, diagnostics: "Container killed by the ApplicationMaster.\n\n", exit_status: 143, ] resource=3D queue=3Ddefault: capacity=3D1.0, absoluteCapacity=3D1.0, usedResources=3DusedCapacity=3D0.2857143, absoluteUsedCapacity=3D0.2857143, numApps=3D1= , numContainers=3D14 usedCapacity=3D0.2857143 absoluteUsedCapacity=3D0.2857143 used=3D cluster=3D 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=3D numContainers=3D14 user=3Dzslf023 user-resources=3D 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueu= e : completedContainer queue=3Droot usedCapacity=3D0.2857143 absoluteUsedCapacity=3D0.2857143 used=3D cluster=3D 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSch= e dulerApp: Completed container: container_1403771939632_0413_01_000016 in state: COMPLETED event:FINISHED 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSch= e dulerNode: Released container container_1403771939632_0413_01_000016 of capacity on host slave4.hadoop.local:57340, which currently has 1 containers, used and available, release resources=3Dtrue 2014-08-04 12:21:49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=3Dzslf023 OPERATION=3DAM Released Container TARGET=3DSchedulerApp RESULT=3DSUCCESS APPID=3Dapplication_1403771939632_0413 CONTAINERID=3Dcontainer_1403771939632_0413_01_000009 2014-08-04 12:21:49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1403771939632_0413_01_000009 Container Transitioned from RUNNING to COMPLETED 2014-08-04 12:21:49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=3D numContainers=3D13 user=3Dzslf023 user-resources=3D 2014-08-04 12:21:49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSch= e dulerApp: Completed container: container_1403771939632_0413_01_000009 in state: COMPLETED event:FINISHED 2014-08-04 12:21:49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSch= e dulerNode: Released container container_1403771939632_0413_01_000009 of capacity on host slave2.hadoop.local:51269, which currently has 6 containers, used and available, release resources=3Dtrue 2014-08-04 12:21:49,216 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySc= h eduler: Application appattempt_1403771939632_0413_000001 released container container_1403771939632_0413_01_000009 on node: host: slave2.hadoop.local:51269 #containers=3D6 available=3D4096 used=3D24576 with event: FINISHED 2014-08-04 12:21:49,216 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=3DContainer: [ContainerId: container_1403771939632_0413_01_000009, NodeId: slave2.hadoop.local:51269, NodeHttpAddress: slave2.hadoop.local:8042, Resource: , Priority: 20, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 413, cluster_timestamp: 1403771939632, }, attemptId: 1, }, id: 9, }, state: C_COMPLETE, diagnostics= : "Container killed by the ApplicationMaster.\n\n", exit_status: 143, ] resource=3D queue=3Ddefault: capacity=3D1.0, absoluteCapacity=3D1.0, usedResources=3DusedCapacity=3D0.26530612, absoluteUsedCapacity=3D0.26530612, numApps=3D1, numContainers=3D13 usedCapacity=3D0.26530612 absoluteUsedCapacity=3D0.26530612 used=3D cluster=3D 2014-08-04 12:21:49,216 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueu= e : completedContainer queue=3Droot usedCapacity=3D0.26530612 absoluteUsedCapacity=3D0.26530612 used=3D cluster=3D 2014-08-04 12:21:49,363 INFO [IPC Server handler 12 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000001_1 2014-08-04 12:21:49,502 INFO [IPC Server handler 13 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000003_1 2014-08-04 12:21:49,623 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1403771939632_0413_01_000016 2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403771939632_0413_m_000006_0: Container killed by the ApplicationMaster. 2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403771939632_0413_m_000006_1: Container killed by the ApplicationMaster. 2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:12 AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0 HostLocal:5 RackLocal:10 2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1403771939632_0413_01_000009 2014-08-04 12:21:49,702 INFO [IPC Server handler 14 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000004_1 From: Ana Gillan Date: Saturday, 2 August 2014 20:02 To: Subject: Re:=20 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException) Ok, I will request this to be done, as I=B9m not an admin, and then get back to this thread on Monday. Thank you! From: hadoop hive Reply-To: Date: Saturday, 2 August 2014 18:50 To: Subject: Re:=20 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException) Hey try change ulimit to 64k for user which running query and change time from scheduler which should be set to 600sec. Check the jt logs also for further issues. Thanks From: Ana Gillan Date: Saturday, 2 August 2014 18:38 To: Subject: Re:=20 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException) I=B9m not sure which user is fetching the data, but I=B9m assuming no one changed that from the default. The data isn=B9t huge in size, just in number, so I suppose the open files limit is not the issue? I=B9m running the job again with mapred.task.timeout=3D1200000, but containers are still being killed in the same way=8A Just without the timeout message. And it somehow massively slowed down the machine as well, so even typing commands took a long time (???) I=B9m not sure what you mean by which stage it=B9s getting killed on. If you mean in the command line progress counters, it's always on Stage-1. Also, this is the end of the container log for the killed container. Failed and killed jobs always start fine with lots of these =B3processing file=B2 and =B3processing alias=B2 statements, but then suddenly warn about a DataStreamer Exception and then are killed with an error, which is the same as the warning. Not sure if this exception is the actual issue or if it=B9s just a knock-on effect of something else. 2014-08-02 17:47:38,618 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz 2014-08-02 17:47:38,641 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek 2014-08-02 17:47:38,932 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz 2014-08-02 17:47:38,989 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek 2014-08-02 17:47:42,675 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz 2014-08-02 17:47:42,888 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek 2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException): No lease on /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp= . -ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have any open files. at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem= . java:2398) at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSName= s ystem.java:2217) at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNa= m esystem.java:2137) at=20 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeR= p cServer.java:491) at=20 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslato= r PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351) at=20 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNa= m enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744) at=20 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Prot= o bufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at=20 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j= a va:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at org.apache.hadoop.ipc.Client.call(Client.java:1240) at=20 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.ja= v a:202) at com.sun.proxy.$Proxy10.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at=20 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:3= 9 ) at=20 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp= l .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at=20 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocat= i onHandler.java:164) at=20 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHan= d ler.java:83) at com.sun.proxy.$Proxy10.addBlock(Unknown Source) at=20 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlo= c k(ClientNamenodeProtocolTranslatorPB.java:311) at=20 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DF= S OutputStream.java:1156) at=20 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D= F SOutputStream.java:1009) at=20 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav= a :464) 2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp= . -ext-10001/_tmp.000006_0 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException): No lease on /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp= . -ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have any open files. at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem= . java:2398) at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSName= s ystem.java:2217) at=20 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNa= m esystem.java:2137) at=20 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeR= p cServer.java:491) at=20 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslato= r PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351) at=20 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNa= m enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744) at=20 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Prot= o bufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at=20 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j= a va:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at org.apache.hadoop.ipc.Client.call(Client.java:1240) at=20 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.ja= v a:202) at com.sun.proxy.$Proxy10.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at=20 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:3= 9 ) at=20 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp= l .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at=20 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocat= i onHandler.java:164) at=20 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHan= d ler.java:83) at com.sun.proxy.$Proxy10.addBlock(Unknown Source) at=20 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlo= c k(ClientNamenodeProtocolTranslatorPB.java:311) at=20 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DF= S OutputStream.java:1156) at=20 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(D= F SOutputStream.java:1009) at=20 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.jav= a :464) Thanks a lot for your attention! From: hadoop hive Reply-To: Date: Saturday, 2 August 2014 17:36 To: Subject: Re:=20 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException) 32k seems fine for mapred user(hope you using this for fetching you data) but if you have huge data on your system you can try 64k. Did you try increasing you time from 600 sec to like 20 mins. Can you also check on which stage its getting hanged or killed. Thanks From: hadoop hive Reply-To: Date: Saturday, 2 August 2014 17:36 To: Subject: Re:=20 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenod= e .LeaseExpiredException) 32k seems fine for mapred user(hope you using this for fetching you data) but if you have huge data on your system you can try 64k. Did you try increasing you time from 600 sec to like 20 mins. Can you also check on which stage its getting hanged or killed. Thanks On Aug 2, 2014 9:38 PM, "Ana Gillan" wrote: > Filemax across the cluster is set to over 6 million. I=B9ve checked the ope= n > file limits for the accounts used by the Hadoop daemons and they have an= open > file limit of 32K. This is confirmed by the various .out files, e.g. >=20 > /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out >=20 > Contains open files (-n) 32768. Is this too low? What is the recommended = value > for open files on all nodes? Also does my own user need to have the same > value? >=20 > I=B9ve also tried running the same column selection on files crushed by the > filecrush program https://github.com/edwardcapriolo/filecrush/ > This created 5 large files out of the 10,000 small files (still totally 2= gb > compressed), but this job won=B9t progress past 0% map. >=20 > From: Ana Gillan > Date: Saturday, 2 August 2014 16:36 > To: > Subject: Re:=20 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namen= ode.L > easeExpiredException) >=20 > For my own user? It is as follows: >=20 > core file size (blocks, -c) 0 >=20 > data seg size (kbytes, -d) unlimited >=20 > scheduling priority (-e) 0 >=20 > file size (blocks, -f) unlimited >=20 > pending signals (-i) 483941 >=20 > max locked memory (kbytes, -l) 64 >=20 > max memory size (kbytes, -m) unlimited >=20 > open files (-n) 1024 >=20 > pipe size (512 bytes, -p) 8 >=20 > POSIX message queues (bytes, -q) 819200 >=20 > real-time priority (-r) 0 >=20 > stack size (kbytes, -s) 8192 >=20 > cpu time (seconds, -t) unlimited >=20 > max user processes (-u) 800 >=20 > virtual memory (kbytes, -v) unlimited >=20 > file locks (-x) unlimited >=20 >=20 > From: hadoop hive > Reply-To: > Date: Saturday, 2 August 2014 16:34 > To: > Subject: Re:=20 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namen= ode.L > easeExpiredException) >=20 >=20 > Can you check the ulimit for tour user. Which might be causing this. >=20 > On Aug 2, 2014 8:54 PM, "Ana Gillan" wrote: >> Hi everyone, >>=20 >> I am having an issue with MapReduce jobs running through Hive being kill= ed >> after 600s timeouts and with very simple jobs taking over 3 hours (or ju= st >> failing) for a set of files with a compressed size of only 1-2gb. I will= try >> and provide as much information as I can here, so if someone can help, t= hat >> would be really great. >>=20 >> I have a cluster of 7 nodes (1 master, 6 slaves) with the following conf= ig: >>> =80 Master node: >>>=20 >>> =AD 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz >>>=20 >>> =AD 64GB DDR3 SDRAM >>>=20 >>> =AD 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5) >>>=20 >>> =80 Slave nodes (each): >>>=20 >>> =AD Intel Xeon 4-core E3-1220v3 @ 3.1GHz >>>=20 >>> =AD 32GB DDR3 SDRAM >>>=20 >>> =AD 4 x 2TB SATA-3 hard drive >>>=20 >>> =80 Operating system on all nodes: openSUSE Linux 13.1 >>=20 >> We have the Apache BigTop package version 0.7, with Hadoop version >> 2.0.6-alpha and Hive version 0.11. >> YARN has been configured as per these recommendations: >> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ >>=20 >> I also set the following additional settings before running jobs: >> set yarn.nodemanager.resource.cpu-vcores=3D4; >> set mapred.tasktracker.map.tasks.maximum=3D4; >> set hive.hadoop.supports.splittable.combineinputformat=3Dtrue; >> set hive.merge.mapredfiles=3Dtrue; >>=20 >> No one else uses this cluster while I am working. >>=20 >> What I=B9m trying to do: >> I have a bunch of XML files on HDFS, which I am reading into Hive using = this >> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create = a >> series of tables from these files and finally run a Python script on one= of >> them to perform some scientific calculations. The files are .xml.gz form= at >> and (uncompressed) are only about 4mb in size each. hive.input.format is= set >> to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid th= e >> =B3small files problem.=B2 >>=20 >> Problems: >> My HQL statements work perfectly for up to 1000 of these files. Even for= much >> larger numbers, doing select * works fine, which means the files are bei= ng >> read properly, but if I do something as simple as selecting just one col= umn >> from the whole table for a larger number of files, containers start bein= g >> killed and jobs fail with this error in the container logs: >>=20 >> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClien= t: >> Failed to close file >> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task= _tmp. >> -ext-10001/_tmp.000000_0 >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.name= node. >> LeaseExpiredException): No lease on >> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task= _tmp. >> -ext-10001/_tmp.000000_0: File does not exist. Holder >> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not h= ave >> any open files. >> at=20 >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesys= tem.j >> ava:2398) >>=20 >> Killed jobs show the above and also the following message: >> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600 >> secsContainer killed by the ApplicationMaster. >>=20 >> Also, in the node logs, I get a lot of pings like this: >> INFO [IPC Server handler 17 on 40961] >> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from >> attempt_1403771939632_0362_m_000002_0 >>=20 >> For 5000 files (1gb compressed), the selection of a single column finish= es, >> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map = and >> then errors out. >>=20 >> While the jobs are running, I notice that the containers are not evenly >> distributed across the cluster. Some nodes lie idle, while the applicati= on >> master node runs 7 containers, maxing out the 28gb of RAM allocated to H= adoop >> on each slave node. >>=20 >> This is the output of netstat =ADi while the column selection is running: >> Kernel Interface table >>=20 >> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-= OVR=20 >> Flg >>=20 >> eth0 1500 0 79515196 0 2265807 0 45694758 0 0 = 0=20 >> BMRU >>=20 >> eth1 1500 0 77410508 0 0 0 40815746 0 0 = 0=20 >> BMRU >>=20 >> lo 65536 0 16593808 0 0 0 16593808 0 0 = 0=20 >> LRU >>=20 >>=20 >>=20 >>=20 >>=20 >> Are there some settings I am missing that mean the cluster isn=B9t process= ing=20 >> this data as efficiently as it can? >>=20 >> I am very new to Hadoop and there are so many logs, etc, that troublesho= oting=20 >> can be a bit overwhelming. Where else should I be looking to try and dia= gnose=20 >> what is wrong? >>=20 >> Thanks in advance for any help you can give! >>=20 >> Kind regards, >> Ana=20 >>=20 --B_3490001552_20879299 Content-type: text/html; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable
Hi,
Unfortunately, after I set my us= er’s ulimit –n to  65536, I still get the same = bad performance, killed containers and errors as before.

<= /div>
I collected together a bunch of logs around the moment when th= e containers are being killed (application master log, killed container log,= hadoop-hdfs logs, hadoop-yarn logs) and this is what’s happening in o= rder. The PrivilegedActionException, is that something? Is my user not in so= me sort of correct group?

Thanks in advance= !
Ana

2014-08-04 12:21:4= 7,868 INFO [IPC Server handler 4 on 49784] org.apache.hadoop.mapred.TaskAtte= mptListenerImpl: Ping from attempt_1403771939632_0413_m_000006_0
2= 014-08-04 12:21:47,979 INFO [IPC Server handler 5 on 49784] org.apache.hadoo= p.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000= 000_0
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]= org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt a= ttempt_1403771939632_0413_m_000002_0 is : 0.061734322
2014-08-04 1= 2:21:48,102 INFO [IPC Server handler 6 on 49784] org.apache.hadoop.mapred.Ta= skAttemptListenerImpl: Status update from attempt_1403771939632_0413_m_00000= 2_0
2014-08-04 12:21:48,110 INFO org.apache.hadoop.hdfs.StateChang= e: BLOCK* allocateBlock: /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240= 516609553931504/_task_tmp.-ext-10001/_tmp.000006_1. BP-1198908146-10.0.0.11-= 1399888065184 blk_-300655159790361685_67008{blockUCState=3DUNDER_CONSTRUCTION,= primaryNodeIndex=3D-1, replicas=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW= ], ReplicaUnderConstruction[10.0.0.43:50010|RBW], ReplicaUnderConstruction[1= 0.0.0.40:50010|RBW]]}
2014-08-04 12:21:48,145 INFO BlockStateChang= e: BLOCK* addStoredBlock: blockMap updated: 10.0.0.40:50010 is added to blk_= -300655159790361685_67008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D= -1, replicas=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW], ReplicaUnderCon= struction[10.0.0.43:50010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW= ]]} size 0
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* a= ddStoredBlock: blockMap updated: 10.0.0.43:50010 is added to blk_-3006551597= 90361685_67008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replica= s=3D[ReplicaUnderConstruction[10.0.0.44:50010|RBW], ReplicaUnderConstruction[1= 0.0.0.43:50010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0<= /div>
2014-08-04 12:21:48,146 INFO BlockStateChange: BLOCK* addStoredBlo= ck: blockMap updated: 10.0.0.44:50010 is added to blk_-300655159790361685_67= 008{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaU= nderConstruction[10.0.0.44:50010|RBW], ReplicaUnderConstruction[10.0.0.43:50= 010|RBW], ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2= 014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.StateChange: DIR* complet= eFile: /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_t= ask_tmp.-ext-10001/_tmp.000006_1 is closed by DFSClient_attempt_140377193963= 2_0413_m_000006_1_-1142346950_1
2014-08-04 12:21:48,147 INFO org.a= pache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 422 Tot= al time for transactions(ms): 5 Number of transactions batched in Syncs: 0 N= umber of syncs: 186 SyncTimes(ms): 204 
2014-08-04 12:21:48,1= 55 INFO [IPC Server handler 7 on 49784] org.apache.hadoop.mapred.TaskAttempt= ListenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000006_1 = is : 0.93497133
2014-08-04 12:21:48,155 INFO [IPC Server handler 7= on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update f= rom attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,176 = INFO [IPC Server handler 8 on 49784] org.apache.hadoop.mapred.TaskAttemptLis= tenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000006_1 is = : 1.0
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]= org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attemp= t_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,177 INFO [Asyn= cDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskA= ttemptImpl: attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned f= rom RUNNING to SUCCESS_CONTAINER_CLEANUP
2014-08-04 12:21:48,177 I= NFO [IPC Server handler 9 on 49784] org.apache.hadoop.mapred.TaskAttemptList= enerImpl: Done acknowledgement from attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4] org.apache.hadoo= p.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_140377193= 9632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLaunch= er #4] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Pr= ocessing the event EventType: CONTAINER_REMOTE_CLEANUP for container contain= er_1403771939632_0413_01_000016 taskAttempt attempt_1403771939632_0413_m_000= 006_1
2014-08-04 12:21:48,195 INFO [AsyncDispatcher event handler]= org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_140377= 1939632_0413_m_000006_1 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEA= NUP to SUCCEEDED
2014-08-04 12:21:48,196 INFO [AsyncDispatcher eve= nt handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num complet= ed Tasks: 2
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event ha= ndler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_= 1403771939632_0413_m_000006_0 TaskAttempt Transitioned from RUNNING to KILL_= CONTAINER_CLEANUP
2014-08-04 12:21:48,196 INFO [AsyncDispatcher ev= ent handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Issuing k= ill to other attempt attempt_1403771939632_0413_m_000006_0
2014-08= -04 12:21:48,196 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapr= educe.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_14037719= 39632_0413_m_000006_1
2014-08-04 12:21:48,196 INFO [AsyncDispatche= r event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_= 1403771939632_0413_m_000006 Task Transitioned from RUNNING to SUCCEEDED
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7] org.apache.hadoop.= mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_14037719396= 32_0413_m_000006_0
2014-08-04 12:21:48,197 INFO [ContainerLauncher= #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Proc= essing the event EventType: CONTAINER_REMOTE_CLEANUP for container container= _1403771939632_0413_01_000009 taskAttempt attempt_1403771939632_0413_m_00000= 6_0
2014-08-04 12:21:48,219 INFO [AsyncDispatcher event handler] o= rg.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_14037719= 39632_0413_m_000006_0 TaskAttempt Transitioned from KILL_CONTAINER_CLEANUP t= o KILL_TASK_CLEANUP
2014-08-04 12:21:48,230 INFO [CommitterEvent P= rocessor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler= : Processing the event EventType: TASK_ABORT
2014-08-04 12:21:48,2= 31 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.j= ob.impl.TaskAttemptImpl: attempt_1403771939632_0413_m_000006_0 TaskAttempt T= ransitioned from KILL_TASK_CLEANUP to KILLED
2014-08-04 12:21:48,3= 18 INFO [IPC Server handler 10 on 49784] org.apache.hadoop.mapred.TaskAttemp= tListenerImpl: Progress of TaskAttempt attempt_1403771939632_0413_m_000002_1= is : 0.047582902
2014-08-04 12:21:48,318 INFO [IPC Server handler= 10 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status updat= e from attempt_1403771939632_0413_m_000002_1
2014-08-04 12:21:48,370 ERROR org.apache.had= oop.security.UserGroupInformation: PriviledgedActionException as:zslf023 (au= th:SIMPLE) cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredExceptio= n: No lease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_82405166095539= 31504/_task_tmp.-ext-10001/_tmp.000006_0: File does not exist. Holder DFSCli= ent_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have any ope= n files.
2014-08-04 12:21:48,370 INFO org.apache.hadoop.ipc.Server: IPC Server= handler 9 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addB= lock from 10.0.0.42:41375: error: org.apache.hadoop.hdfs.server.namenode.Lea= seExpiredException: No lease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_4= 75_8240516609553931504/_task_tmp.-ext-10001/_tmp.000006_0: File does not exi= st. Holder DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does= not have any open files.
2014-08-04 12:21:48,378 WARN [Thread-8] org.apache.h= adoop.hdfs.DFSClient: DataStreamer Exception 
org.apache.hadoop.ipc.Remo= teException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): N= o lease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_824051660955393150= 4/_task_tmp.-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_= attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have any open fi= les.
= 2014-08-04 12:21:48,379 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient= : Failed to close file /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_824051= 6609553931504/_task_tmp.-ext-10001/_tmp.000006_0
<= b style=3D"background-color: rgb(255, 255, 10);">org.apache.hadoop.ipc.RemoteE= xception(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No l= ease on /tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_= task_tmp.-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_att= empt_1403771939632_0413_m_000006_0_-820182675_1 does not have any open files= .
2014-08-04 12:21:48,383 INFO [IPC Server handler 11 on= 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_= 1403771939632_0413_m_000000_1
2014-08-04 12:21:48,620 INFO [RMComm= unicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocat= or: Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 Assigne= dMaps:14 AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel= :0 HostLocal:5 RackLocal:10
2014-08-04 12:21:49,198 INFO org.apach= e.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_= 1403771939632_0413_01_000016 Container Transitioned from RUNNING to COMPLETE= D
2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resou= rcemanager.RMAuditLogger: USER=3Dzslf023 OPERATION=3DAM Released Container TARGET=3DSchedulerApp RESULT=3DSUCCESS APPID=3Dapplication_1403771939632_04= 13 CONTAINERID=3Dc= ontainer_1403771939632_0413_01_000016
2014-08-04 12:21:49,199 INFO= org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySc= heduler: Application appattempt_1403771939632_0413_000001 released container= container_1403771939632_0413_01_000016 on node: host: slave4.hadoop.local:5= 7340 #containers=3D1 available=3D24576 used=3D4096 with event: FINISHED
= 2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.server.resourcemanager.s= cheduler.capacity.LeafQueue: completedContainer container=3DContainer: [Contai= nerId: container_1403771939632_0413_01_000016, NodeId: slave4.hadoop.local:5= 7340, NodeHttpAddress: slave4.hadoop.local:8042, Resource: <memory:4096, = vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,= app_attempt_id {, application_id {, id: 413, cluster_timestamp: 14037719396= 32, }, attemptId: 1, }, id: 16, }, state: C_COMPLETE, diagnostics: "Containe= r killed by the ApplicationMaster.\n\n", exit_status: 143, ] resource=3D<me= mory:4096, vCores:1> queue=3Ddefault: capacity=3D1.0, absoluteCapacity=3D1.0, u= sedResources=3D<memory:57344, vCores:14>usedCapacity=3D0.2857143, absolute= UsedCapacity=3D0.2857143, numApps=3D1, numContainers=3D14 usedCapacity=3D0.2857143 a= bsoluteUsedCapacity=3D0.2857143 used=3D<memory:57344, vCores:14> cluster=3D&= lt;memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO org.= apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: defa= ult used=3D<memory:57344, vCores:14> numContainers=3D14 user=3Dzslf023 user-= resources=3D<memory:57344, vCores:14>
2014-08-04 12:21:49,199 = INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Parent= Queue: completedContainer queue=3Droot usedCapacity=3D0.2857143 absoluteUsedCapa= city=3D0.2857143 used=3D<memory:57344, vCores:14> cluster=3D<memory:20070= 4, vCores:112>
2014-08-04 12:21:49,199 INFO org.apache.hadoop.y= arn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed= container: container_1403771939632_0413_01_000016 in state: COMPLETED event= :FINISHED
2014-08-04 12:21:49,199 INFO org.apache.hadoop.yarn.serv= er.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Released contain= er container_1403771939632_0413_01_000016 of capacity <memory:4096, vCore= s:1> on host slave4.hadoop.local:57340, which currently has 1 containers,= <memory:4096, vCores:1> used and <memory:24576, vCores:15> avai= lable, release resources=3Dtrue
2014-08-04 12:21:49,215 INFO org.apa= che.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=3Dzslf023 OPERATION=3DAM Released Con= tainer TARGET=3DSc= hedulerApp RESUL= T=3DSUCCESS APPID=3D= application_1403771939632_0413 CONTAINERID=3Dcontainer_1403771939632_0413_01_000009
= 2014-08-04 12:21:49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.r= mcontainer.RMContainerImpl: container_1403771939632_0413_01_000009 Container= Transitioned from RUNNING to COMPLETED
2014-08-04 12:21:49,215 IN= FO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueu= e: default used=3D<memory:53248, vCores:13> numContainers=3D13 user=3Dzslf02= 3 user-resources=3D<memory:53248, vCores:13>
2014-08-04 12:21:= 49,215 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.f= ica.FiCaSchedulerApp: Completed container: container_1403771939632_0413_01_0= 00009 in state: COMPLETED event:FINISHED
2014-08-04 12:21:49,215 I= NFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCa= SchedulerNode: Released container container_1403771939632_0413_01_000009 of = capacity <memory:4096, vCores:1> on host slave2.hadoop.local:51269, wh= ich currently has 6 containers, <memory:24576, vCores:6> used and <= memory:4096, vCores:10> available, release resources=3Dtrue
2014-= 08-04 12:21:49,216 INFO org.apache.hadoop.yarn.server.resourcemanager.schedu= ler.capacity.CapacityScheduler: Application appattempt_1403771939632_0413_00= 0001 released container container_1403771939632_0413_01_000009 on node: host= : slave2.hadoop.local:51269 #containers=3D6 available=3D4096 used=3D24576 with eve= nt: FINISHED
2014-08-04 12:21:49,216 INFO org.apache.hadoop.yarn.s= erver.resourcemanager.scheduler.capacity.LeafQueue: completedContainer conta= iner=3DContainer: [ContainerId: container_1403771939632_0413_01_000009, NodeId= : slave2.hadoop.local:51269, NodeHttpAddress: slave2.hadoop.local:8042, Reso= urce: <memory:4096, vCores:1>, Priority: 20, State: NEW, Token: null, = Status: container_id {, app_attempt_id {, application_id {, id: 413, cluster= _timestamp: 1403771939632, }, attemptId: 1, }, id: 9, }, state: C_COMPLETE, = diagnostics: "Container killed by the ApplicationMaster.\n\n", exit_status: = 143, ] resource=3D<memory:4096, vCores:1> queue=3Ddefault: capacity=3D1.0, a= bsoluteCapacity=3D1.0, usedResources=3D<memory:53248, vCores:13>usedCapaci= ty=3D0.26530612, absoluteUsedCapacity=3D0.26530612, numApps=3D1, numContainers=3D13 = usedCapacity=3D0.26530612 absoluteUsedCapacity=3D0.26530612 used=3D<memory:5324= 8, vCores:13> cluster=3D<memory:200704, vCores:112>
2014-08= -04 12:21:49,216 INFO org.apache.hadoop.yarn.server.resourcemanager.schedule= r.capacity.ParentQueue: completedContainer queue=3Droot usedCapacity=3D0.2653061= 2 absoluteUsedCapacity=3D0.26530612 used=3D<memory:53248, vCores:13> clust= er=3D<memory:200704, vCores:112>
2014-08-04 12:21:49,363 INFO = [IPC Server handler 12 on 49784] org.apache.hadoop.mapred.TaskAttemptListene= rImpl: Ping from attempt_1403771939632_0413_m_000001_1
2014-08-04 = 12:21:49,502 INFO [IPC Server handler 13 on 49784] org.apache.hadoop.mapred.= TaskAttemptListenerImpl: Ping from attempt_1403771939632_0413_m_000003_1
2014-08-04 12:21:49,623 INFO [RMCommunicator Allocator] org.apache.ha= doop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container = container_1403771939632_0413_01_000016
2014-08-04 12:21:49,624 INF= O [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.imp= l.TaskAttemptImpl: Diagnostics report from attempt_1403771939632_0413_m_0000= 06_0: Container killed by the ApplicationMaster.
2014-08-04 12:21:= 49,624 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.a= pp.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403771939632_0= 413_m_000006_1: Container killed by the ApplicationMaster.
2014-08= -04 12:21:49,624 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce= .v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMa= ps:0 ScheduledReds:0 AssignedMaps:12 AssignedReds:0 CompletedMaps:2 Complete= dReds:0 ContAlloc:15 ContRel:0 HostLocal:5 RackLocal:10
2014-08-04= 12:21:49,624 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2= .app.rm.RMContainerAllocator: Received completed container container_1403771= 939632_0413_01_000009
2014-08-04 12:21:49,702 INFO [IPC Server han= dler 14 on 49784] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping fro= m attempt_1403771939632_0413_m_000004_1



F= rom: Ana Gillan <ana.gillan= @gmail.com>
Date: Saturday,= 2 August 2014 20:02
To: <user@hadoop.apache.org>
Subject: Re: org.apache.hadoop.ipc.RemoteExc= eption(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Ok, I will request this to b= e done, as I’m not an admin, and then get back to this thread on Monda= y. Thank you! 

From: hadoop hive <hadoophive@gmail= .com>
Reply-To: <user@hadoop.apache.org>
Date: Saturday, 2 August 2014 18:50
To: <user@hadoop.apache.org>
Subject= : Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.s= erver.namenode.LeaseExpiredException)

H= ey try change ulimit to 64k for user which running query and change time fro= m scheduler which should be set to 600sec.

Check the jt logs= also for further issues.

Thanks


From: Ana Gillan <ana.gillan@gmail.com>
Date:=  Saturday, 2 August 2014 18:38
To: <user@hadoop.a= pache.org>
Subject: R= e: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namen= ode.LeaseExpiredException)

I’m not sure which user is fetching the data, but I’m = assuming no one changed that from the default. The data isn’t huge in = size, just in number, so I suppose the open files limit is not the issue?

I’m running the job again with mapred.task.time= out=3D1200000, but containers are still being killed in the same way… Ju= st without the timeout message. And it somehow massively slowed down the mac= hine as well, so even typing commands took a long time (???)

<= /div>
I’m not sure what you mean by which stage it’s getting= killed on. If you mean in the command line progress counters, it's always o= n Stage-1.
Also, this is the end of the container log for the kill= ed container. Failed and killed jobs always start fine with lots of these &#= 8220;processing file” and “processing alias” statements, b= ut then suddenly warn about a DataStreamer Exception and then are killed wit= h an error, which is the same as the warning. Not sure if this exception is = the actual issue or if it’s just a knock-on effect of something else.<= /div>

2014-08-02 17:47:38,618 INFO [main] org.apache= .hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://clus= tnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:= 38,641 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Processing al= ias foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fiv= ek
2014-08-02 17:47:38,932 INFO [main] org.apache.hadoop.hive.ql.i= o.HiveContextAwareRecordReader: Processing file hdfs://clustnm:8020/user/usn= m123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main= ] org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias foldernm_xml_= load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014= -08-02 17:47:42,675 INFO [main] org.apache.hadoop.hive.ql.io.HiveContextAwar= eRecordReader: Processing file hdfs://clustnm:8020/user/usnm123/foldernm/fiv= ek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main] org.apache.hado= op.hive.ql.exec.MapOperator: Processing alias foldernm_xml_load for file hdf= s://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,4= 16 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception<= /div>
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.serve= r.namenode.LeaseExpiredException): No lease on /tmp/hive-usnm123/hive_2014-0= 8-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0: Fil= e does not exist. Holder DFSClient_attempt_1403771939632_0409_m_000006_0_303= 479000_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenod= e.FSNamesystem.checkLease(FSNamesystem.java:2398)
at org.apache.hadoop.hdfs.ser= ver.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
at org.apach= e.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.j= ava:2137)
= at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(= NameNodeRpcServer.java:491)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenode= ProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTran= slatorPB.java:351)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProto= colProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocol= Protos.java:40744)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRp= cInvoker.call(ProtobufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.c= all(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:17= 41)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
<= span class=3D"Apple-tab-span" style=3D"white-space: pre;"> at java.securi= ty.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.= doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(= UserGroupInformation.java:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Ser= ver.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:= 1240)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngi= ne.java:202)
at sun.reflect.Nat= iveMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccesso= rImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodA= ccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.M= ethod.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHand= ler.invokeMethod(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry= .RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at com.sun.proxy= .$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientN= amenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java= :311)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlo= ck(DFSOutputStream.java:1156)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStr= eamer.nextBlockOutputStream(DFSOutputStream.java:1009)
at org.apache.hadoop.hdf= s.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
2014-= 08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed= to close file /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_25154873485089= 0001/_task_tmp.-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.Remo= teException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): N= o lease on /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001= /_task_tmp.-ext-10001/_tmp.000006_0: File does not exist. Holder DFSClient_a= ttempt_1403771939632_0409_m_000006_0_303479000_1 does not have any open file= s.
= at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesyst= em.java:2398)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFil= eState(FSNamesystem.java:2217)
at org.apache.hadoop.hdfs.server.namenode.FSName= system.getAdditionalBlock(FSNamesystem.java:2137)
at org.apache.hadoop.hdfs.ser= ver.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
at org.apac= he.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBl= ock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at org.apache.hadoop= .hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.c= allBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at org.apache.hadoop= .ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java= :454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop= .ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Han= dler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Nat= ive Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.ha= doop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.= apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

<= div> at org.ap= ache.hadoop.ipc.Client.call(Client.java:1240)
at org.apache.hadoop.ipc.Protobuf= RpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at com.sun.proxy.$Proxy10.= addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native= Method)
<= /span>at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp= l.java:39)
= at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodA= ccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
<= div> at org.ap= ache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHand= ler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Retry= InvocationHandler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at o= rg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock= (ClientNamenodeProtocolTranslatorPB.java:311)
at org.apache.hadoop.hdfs.DFSOutp= utStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
<= div> at org.ap= ache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutpu= tStream.java:1009)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(D= FSOutputStream.java:464)

Thanks=
 a lot for your attention!	

=
From: h= adoop hive <hadoophive@gmail.com>
Reply-To: <
user@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <user@hadoop.apache.org>
Subject: Re: org.apache.hadoop.ipc.RemoteException(org.apa= che.hadoop.hdfs.server.namenode.LeaseExpiredException)

32k seems fine for mapred user(hope you using this for fetch= ing you data) but if you have huge data on your system you can try 64k.

<= p dir=3D"ltr">Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed= .

Thanks

From: hadoop hive <hadoophive@gmail.com>
Reply-To: <user= @hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To:&n= bsp;<user@hadoop.apache.or= g>
Subject: Re: org.a= pache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.Leas= eExpiredException)

32k seems fine for m= apred user(hope you using this for fetching you data) but if you have huge d= ata on your system you can try 64k.

Did you try increasing y= ou time from 600 sec to like 20 mins.

Can you also check on = which stage its getting hanged or killed.

Thanks

On Aug 2, 2014 9:38 PM, "Ana Gillan" <ana.gillan@gmail.com> wrote:
Filema= x across the cluster is set to over 6 million. I’ve checked the open file limits for the accounts= used by the Hadoop daemons  and they have an open file limit= of 32K. This is confirmed by the various .out files, e.g.

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out

Contains open files (-n) 32768. Is this too low? What is th= e recommended value for open files on all nodes? Also does my own user need = to have the same value?

I’ve also tried runn= ing the same column selection on files crushed by the filecrush program = ;http= s://github.com/edwardcapriolo/filecrush/ 
This created 5 = large files out of the 10,000 small files (still totally 2gb compressed), bu= t this job won’t progress past 0% map.

From: Ana Gillan <ana.gillan@gmail.com>
Date: Saturday, 2 August 2014 16:36
To: <user@hadoop.apache.org>
Subject: Re: org.apache.hadoop.ipc.RemoteExcepti= on(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

For my own user? = It is as follows:

core file size          (blocks, -= c) 0

data seg size &nbs= p;         (kbytes, -d) unlimited

scheduling priority       &nbs= p;     (-e) 0

file size               (blocks, -f) unl= imited

pending signals =                 (-i) 483941

max locked memory    =   (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files      =                 (-n) 1024

pipe size      &nb= sp;     (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority    =           (-r) 0

stack size             = ; (kbytes, -s) 8192

cpu= time               (seconds, -t) unlimit= ed

max user processes&n= bsp;             (-u) 800

virtual memory        &= nbsp; (kbytes, -v) unlimited

file locks                &nbs= p;     (-x) unlimited


From: hadoop hive <hadoophive@gmail.com>= ;
Reply-To: <user@hadoop.apache.org>Date: Saturday, 2 August 2014= 16:34
To: <user@hadoop.apache.org>Subject: Re: org.apache.hadoop= .ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredExce= ption)

Can you check the ulimit for tou= r user. Which might be causing this.

On Aug 2, 2= 014 8:54 PM, "Ana Gillan" <ana.gillan@gmail.com> wrote:
Hi everyone,

I am having an issue with MapReduce jobs running through Hiv= e being killed after 600s timeouts and with very simple jobs taking over 3 h= ours (or just failing) for a set of files with a compressed size of only 1-2= gb. I will try and provide as much information as I can here, so if someone = can help, that would be really great.

I have a clus= ter of 7 nodes (1 master, 6 slaves) with the following config:

• Master node:

– 2 x Intel Xeon 6-core = E5-2620v2 @ 2.1GHz

– 64GB DDR3 SDRAM

– 8 x 2TB SAS 6= 00 hard drive (arranged as RAID 1 and RAID 5)

• Slave nodes (eac= h):

– Intel Xeon 4-core E3-1220v3 @ 3.1GHz

– 32GB DD= R3 SDRAM

– 4 x 2TB SATA-3 hard drive

• Operating sys= tem on all nodes: openSUSE Linux 13.1 = ;

We have the Apache BigTop package = version 0.7, with Hadoop version 2.0.6-alpha and Hive version 0.11.
YARN has been configured as per these recommendations: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0= /

I also set the following additional settings = before running jobs:
set yarn.nodemanager.resource.cpu-vcores= =3D4;
set mapred.tasktracker.map.tasks.maximum=3D4;
set hive= .hadoop.supports.splittable.combineinputformat=3Dtrue;
set hive.merg= e.mapredfiles=3Dtrue;

No one else uses this clu= ster while I am working.

What I’m trying to d= o:
I have a bunch of XML files on = HDFS, which I am reading into Hive using this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a series of tables from these files and = finally run a Python script on one of them to perform some scientific calcul= ations. The files are .xml.gz format and (uncompressed) are only about 4mb i= n size each. hive.input.format is set to org.apache.hadoop.hive.= ql.io.CombineHiveInputFormat so as to avoid the “small files problem.&= #8221; 

Problems:
My HQL statements = work perfectly for up to 1000 of these files. Even for much larger numbers, = doing select * works fine, which means the files are being read properly, bu= t if I do something as simple as selecting just one column from the whole ta= ble for a larger number of files, containers start being killed and jobs fai= l with this error in the container logs:

2014-= 08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed= to close file /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_64558225417481= 33957/_task_tmp.-ext-10001/_tmp.000000_0
org.apache.hadoop.ipc.Rem= oteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): = No lease on /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_64558225417481339= 57/_task_tmp.-ext-10001/_tmp.000000_0: File does not exist. Holder DFSClient= _attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have any open = files.
at org.apache.= hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)<= /div>

Killed jobs show the above and also the following m= essage: 
AttemptID:attempt_1403771939632_0402_m_000000_0 Time= d out after 600 secsContainer killed by the ApplicationMaster. 

Also, in the node logs, I get a lot of pings like this:
INFO [IPC Server handler = 17 on 40961] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from att= empt_1403771939632_0362_m_000002_0

For 500= 0 files (1gb compressed), the selection of a single column finishes, but tak= es over 3 hours. For 10,000 files, the job hangs on about 4% map and then er= rors out.

While the jobs are running, = ;I notice that the containers are not evenly distributed across the cluster.= Some nodes lie idle, while the application master node runs 7 containers, m= axing out the 28gb of RAM allocated to Hadoop on each slave node.

This is the ou= tput of netstat –i while the column selection is running:
Kernel In= terface table

Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR =   TX-OK TX-ERR TX-DRP TX-OVR Flg

eth0   1500   0 79515196  =     0 2265807     0 45694758     &nb= sp;0      0      0 BMRU

eth1   1500=   0 77410508      0      0    =   0 40815746      0      0    &= nbsp; 0 BMRU

lo    65536   0 16593808      0 =     0      0 16593808      0  =     0      0 LRU





A= re there some settings I am missing that mean the cluster isn’t proces= sing this data as efficiently as it can?

I am very = new to Hadoop and there are so many logs, etc, that troubleshooting can be a= bit overwhelming. Where else should I be looking to try and diagnose what i= s wrong?

Thanks in advance for any help you can giv= e!

Kind regards,
Ana 


--B_3490001552_20879299--