hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Reduce Hangs at 66%
Date Fri, 04 May 2012 15:22:45 GMT
Well 
That was one of the things I had asked. 
ulimit -a says it all. 

But you have to do this for the users... hdfs, mapred, and hadoop

(Which is why I asked about which flavor.)

On May 3, 2012, at 7:03 PM, Raj Vishwanathan wrote:

> Keith
> 
> What is the the output for ulimit -n? Your value for number of open files is probably
too low.
> 
> Raj
> 
> 
> 
> 
>> ________________________________
>> From: Keith Thompson <kthomps6@binghamton.edu>
>> To: common-user@hadoop.apache.org 
>> Sent: Thursday, May 3, 2012 4:33 PM
>> Subject: Re: Reduce Hangs at 66%
>> 
>> I am not sure about ulimits, but I can answer the rest. It's a Cloudera
>> distribution of Hadoop 0.20.2. The HDFS has 9 TB free. In the reduce step,
>> I am taking keys in the form of (gridID, date), each with a value of 1. The
>> reduce step just sums the 1's as the final output value for the key (It's
>> counting how many people were in the gridID on a certain day).
>> 
>> I have been running other more complicated jobs with no problem, so I'm not
>> sure why this one is being peculiar. This is the code I used to execute the
>> program from the command line (the source is a file on the hdfs):
>> 
>> hadoop jar <jarfile> <driver> <source> /thompson/outputDensity/density1
>> 
>> The program then executes the map and gets to 66% of the reduce, then stops
>> responding. The cause of the error seems to be:
>> 
>> Error from attempt_201202240659_6432_r_000000_1: java.io.IOException:
>>> The temporary job-output directory
>>> hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
>>> doesn't exist!
>> 
>> I don't understand what the _temporary is. I am assuming it's something
>> Hadoop creates automatically.
>> 
>> 
>> 
>> On Thu, May 3, 2012 at 5:02 AM, Michel Segel <michael_segel@hotmail.com>wrote:
>> 
>>> Well...
>>> Lots of information but still missing some of the basics...
>>> 
>>> Which release and version?
>>> What are your ulimits set to?
>>> How much free disk space do you have?
>>> What are you attempting to do?
>>> 
>>> Stuff like that.
>>> 
>>> 
>>> 
>>> Sent from a remote device. Please excuse any typos...
>>> 
>>> Mike Segel
>>> 
>>> On May 2, 2012, at 4:49 PM, Keith Thompson <kthomps6@binghamton.edu>
>>> wrote:
>>> 
>>>> I am running a task which gets to 66% of the Reduce step and then hangs
>>>> indefinitely. Here is the log file (I apologize if I am putting too much
>>>> here but I am not exactly sure what is relevant):
>>>> 
>>>> 2012-05-02 16:42:52,975 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Adding task (REDUCE) 'attempt_201202240659_6433_r_000000_0' to tip
>>>> task_201202240659_6433_r_000000, for tracker
>>>> 'tracker_analytix7:localhost.localdomain/127.0.0.1:56515'
>>>> 2012-05-02 16:42:53,584 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Task 'attempt_201202240659_6433_m_000001_0' has completed
>>>> task_201202240659_6433_m_000001 successfully.
>>>> 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> Error from attempt_201202240659_6432_r_000000_0: Task
>>>> attempt_201202240659_6432_r_000000_0 failed to report status for 1800
>>>> seconds. Killing!
>>>> 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Removing task 'attempt_201202240659_6432_r_000000_0'
>>>> 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Adding task (TASK_CLEANUP) 'attempt_201202240659_6432_r_000000_0' to
>>>> tip task_201202240659_6432_r_000000, for tracker
>>>> 'tracker_analytix4:localhost.localdomain/127.0.0.1:44204'
>>>> 2012-05-02 17:00:48,763 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Removing task 'attempt_201202240659_6432_r_000000_0'
>>>> 2012-05-02 17:00:48,957 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Adding task (REDUCE) 'attempt_201202240659_6432_r_000000_1' to tip
>>>> task_201202240659_6432_r_000000, for tracker
>>>> 'tracker_analytix5:localhost.localdomain/127.0.0.1:59117'
>>>> 2012-05-02 17:00:56,559 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> Error from attempt_201202240659_6432_r_000000_1: java.io.IOException:
>>>> The temporary job-output directory
>>>> hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
>>>> doesn't exist!
>>>>     at
>>> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>>>>     at
>>> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)
>>>>     at
>>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>>>>     at
>>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438)
>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
>>>>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>>     at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>     at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>>     at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>> 
>>>> 2012-05-02 17:00:59,903 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Removing task 'attempt_201202240659_6432_r_000000_1'
>>>> 2012-05-02 17:00:59,906 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Adding task (REDUCE) 'attempt_201202240659_6432_r_000000_2' to tip
>>>> task_201202240659_6432_r_000000, for tracker
>>>> 'tracker_analytix3:localhost.localdomain/127.0.0.1:39980'
>>>> 2012-05-02 17:01:07,200 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> Error from attempt_201202240659_6432_r_000000_2: java.io.IOException:
>>>> The temporary job-output directory
>>>> hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
>>>> doesn't exist!
>>>>     at
>>> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>>>>     at
>>> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)
>>>>     at
>>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>>>>     at
>>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438)
>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
>>>>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>>     at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>     at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>>     at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>> 
>>>> 2012-05-02 17:01:10,239 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Removing task 'attempt_201202240659_6432_r_000000_2'
>>>> 2012-05-02 17:01:10,283 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Adding task (REDUCE) 'attempt_201202240659_6432_r_000000_3' to tip
>>>> task_201202240659_6432_r_000000, for tracker
>>>> 'tracker_analytix2:localhost.localdomain/127.0.0.1:33297'
>>>> 2012-05-02 17:01:18,188 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> Error from attempt_201202240659_6432_r_000000_3: java.io.IOException:
>>>> The temporary job-output directory
>>>> hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
>>>> doesn't exist!
>>>>     at
>>> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>>>>     at
>>> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)
>>>>     at
>>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>>>>     at
>>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438)
>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
>>>>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>>     at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>     at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>>     at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>> 
>>>> 2012-05-02 17:01:21,228 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> TaskInProgress task_201202240659_6432_r_000000 has failed 4 times.
>>>> 2012-05-02 17:01:21,228 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Aborting job job_201202240659_6432
>>>> 2012-05-02 17:01:21,228 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Killing job 'job_201202240659_6432'
>>>> 2012-05-02 17:01:21,228 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Adding task (JOB_CLEANUP) 'attempt_201202240659_6432_m_000002_0' to
>>>> tip task_201202240659_6432_m_000002, for tracker
>>>> 'tracker_analytix2:localhost.localdomain/127.0.0.1:33297'
>>>> 2012-05-02 17:01:21,228 INFO org.apache.hadoop.mapred.JobTracker:
>>>> Removing task 'attempt_201202240659_6432_r_000000_3'
>>>> 2012-05-02 17:01:22,443 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Task 'attempt_201202240659_6432_m_000002_0' has completed
>>>> task_201202240659_6432_m_000002 successfully.
>>>> 2012-05-02 17:01:22,443 INFO
>>>> org.apache.hadoop.mapred.JobInProgress$JobSummary:
>>>> 
>>> jobId=job_201202240659_6432,submitTime=1335990629637,launchTime=1335990629702,firstMapTaskLaunchTime=1335990631252,firstReduceTaskLaunchTime=1335990633988,firstJobSetupTaskLaunchTime=1335990629727,firstJobCleanupTaskLaunchTime=1335992481228,finishTime=1335992482443,numMaps=2,numSlotsPerMap=1,numReduces=1,numSlotsPerReduce=1,user=thompson,queue=default,status=FAILED,mapSlotSeconds=7,reduceSlotsSeconds=32,clusterMapCapacity=138,clusterReduceCapacity=138
>>>> 2012-05-02 17:01:22,446 WARN org.apache.hadoop.hdfs.DFSClient:
>>>> DataStreamer Exception: org.apache.hadoop.ipc.RemoteException:
>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
>>>> on
>>> /thompson/outputDensity/density1/_logs/history/analytix1_1330084785363_job_201202240659_6432_thompson_TwitterDensity
>>>> File does not exist. Holder DFSClient_1931891231 does not have any
>>>> open files.
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1542)
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1533)
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1449)
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649)
>>>>     at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
>>>>     at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
>>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
>>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>>     at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>     at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)
>>>> 
>>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1104)
>>>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>>>     at $Proxy4.addBlock(Unknown Source)
>>>>     at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>>>>     at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>>>     at
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>>>     at $Proxy4.addBlock(Unknown Source)
>>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3185)
>>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3055)
>>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1900(DFSClient.java:2305)
>>>>     at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2500)
>>>> 
>>>> 2012-05-02 17:01:22,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>> Recovery for block null bad datanode[0] nodes == null
>>>> 2012-05-02 17:01:22,446 WARN org.apache.hadoop.hdfs.DFSClient: Could
>>>> not get block locations. Source file
>>>> 
>>> "/thompson/outputDensity/density1/_logs/history/analytix1_1330084785363_job_201202240659_6432_thompson_TwitterDensity"
>>>> - Aborting...
>>>> 2012-05-02 17:01:22,446 INFO org.apache.hadoop.mapred.JobHistory:
>>>> Moving
>>> file:/var/log/hadoop/history/analytix1_1330084785363_job_201202240659_6432_thompson_TwitterDensity
>>>> to file:/var/log/hadoop/history/done
>>>> 2012-05-02 17:01:22,447 ERROR org.apache.hadoop.hdfs.DFSClient:
>>>> Exception closing file
>>>> 
>>> /thompson/outputDensity/density1/_logs/history/analytix1_1330084785363_job_201202240659_6432_thompson_TwitterDensity
>>>> : org.apache.hadoop.ipc.RemoteException:
>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
>>>> on
>>> /thompson/outputDensity/density1/_logs/history/analytix1_1330084785363_job_201202240659_6432_thompson_TwitterDensity
>>>> File does not exist. Holder DFSClient_1931891231 does not have any
>>>> open files.
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1542)
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1533)
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1449)
>>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649)
>>>>     at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
>>>>     at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
>>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
>>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>>     at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>     at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)
>>>> 
>>>> 
>>>> Can anyone explain to me what is causing the problem?
>>>> 
>>>> --
>>>> *Keith Thompson*
>>> 
>> 
>> 
>> 
>> -- 
>> *Keith Thompson*
>> Graduate Research Associate, Xerox Corporation
>> SUNY Research Foundation
>> Dept. of Systems Science and Industrial Engineering
>> Binghamton University
>> Binghamton, NY 13902
>> 
>> 


Mime
View raw message