hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-147) MapTask removed mapout files before the reduce tasks copy them
Date Wed, 19 Apr 2006 16:11:35 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-147?page=all ]

Owen O'Malley updated HADOOP-147:
---------------------------------

    Description: 
I was running a job on a cluster of 138 nodes. The job had 1050 map tasks and 128 reduce tasks.
It stucked at the reduce stage.
All the reduce tasks were trying to copy file from a map task with the following status show
on the web interface:

reduce > copy > task_m_ehz5q1@node1262.foo.com:60040

However, the log on the machine node1262 (where the map task task_m_ehz5q1 ran) showed that
the map task finished even before the 
reduce tasks copied the map output files:

060417 103554 Server connection on port 60050 from 72.30.117.220: starting
060417 103554 task_m_ehz5q1  Client connection to 0.0.0.0:60050: starting
060417 103554 task_m_ehz5q1 1.0% /user/runping/runping/proj/part-00039:0+71
060417 103554 Task task_m_ehz5q1 is done.
060417 103554 parsing file:/local/hadoop/conf2/hadoop-default.xml

......................

060417 103613 parsing file:/local/hadoop/conf2/hadoop-site.xml
060417 103623 task_m_ehz5q1 done; removing files.
060417 103633 parsing file:/local/hadoop/conf2/hadoop-default.xml
060417 103633 parsing file:/local/hadoop/conf2/mapred-default.xml
060417 103633 parsing file:/local/hadoop/conf2/hadoop-site.xml

...........................................

060417 190241 SEVERE Can't open map output:/local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
        at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
        at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
        at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
        at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
060417 190241 Unknown child with bad map output: task_m_ehz5q1. Ignored.
060417 190241 Server handler 2 on 60040 caught: java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
        at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
        at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
        at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
060417 190241 parsing file:/local/hadoop/conf2/hadoop-default.xml
060417 190241 parsing file:/local/hadoop/conf2/mapred-default.xml

And the above exceptions repeated for many (not sure whether it is tru for all the reduce
task) other reduce tasks.

Another strange thing noticed from the logs.

On another machine's log, I saw:

060417 190528 parsing file:/local/hadoop/conf2/hadoop-site.xml
060417 190528 task_r_24d8k4 copy failed: task_m_ehz5q1 from node1262.foo.com/72.30.117.220:60040
java.io.IOException: timed out waiting for response
        at org.apache.hadoop.ipc.Client.call(Client.java:305)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
        at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
        at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:110)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
060417 190528 task_r_24d8k4 0.11523809% reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
060417 190528 task_r_24d8k4 Copying task_m_epatk8 output from node1387.foo.com.
                               
which is expected. However, before this line, 
I saw another copy activity in the log:

060417 103608 parsing file:/local/hadoop/conf2/hadoop-site.xml
060417 103608 task_r_a4yl3t Copying task_m_ehz5q1 output from node1262.foo.com.
060417 103608 parsing file:/local/hadoop/conf2/hadoop-default.xml

And the task task_r_a4yl3t does not belong to the concerned job, 
according to the Web interface. That is strange.

And I checked a few other machines where some reduce tasks ran, 
and I saw the same thing.

I suspect there was a conflict in job ID. If two jobs had the same ID, 
when one closes, it may also mark the other as "closed" too, thus trggering map tasks
to clean up prematurely.

A simple way to avoid potential jobid conflict is to use sequential numbers.





  was:


I was running a job on a cluster of 138 nodes. The job had 1050 map tasks and 128 reduce tasks.
It stucked at the reduce stage.
All the reduce tasks were trying to copy file from a map task with the following status show
on the web interface:

reduce > copy > task_m_ehz5q1@kry1262.inktomisearch.com:60040

However, the log on the machine kry1262 (where the map task task_m_ehz5q1 ran) showed that
the map task finished even before the 
reduce tasks copied the map output files:

060417 103554 Server connection on port 60050 from 72.30.117.220: starting
060417 103554 task_m_ehz5q1  Client connection to 0.0.0.0:60050: starting
060417 103554 task_m_ehz5q1 1.0% /user/runping/runping/docs_store/stage_51/inverted_link_docs/part-00039:0+71
060417 103554 Task task_m_ehz5q1 is done.
060417 103554 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-default.xml

......................

060417 103613 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-site.xml
060417 103623 task_m_ehz5q1 done; removing files.
060417 103633 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-default.xml
060417 103633 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/mapred-default.xml
060417 103633 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-site.xml

...........................................

060417 190241 SEVERE Can't open map output:/export/crawlspace/kryptonite/hadoop/mapred/local/task_m_ehz5q1/part-32.out
java.io.FileNotFoundException: /export/crawlspace/kryptonite/hadoop/mapred/local/task_m_ehz5q1/part-32.out
        at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
        at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
        at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
        at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
060417 190241 Unknown child with bad map output: task_m_ehz5q1. Ignored.
060417 190241 Server handler 2 on 60040 caught: java.io.FileNotFoundException: /export/crawlspace/kryptonite/hadoop/mapred/local/task_m_ehz5q1/part-32.out
java.io.FileNotFoundException: /export/crawlspace/kryptonite/hadoop/mapred/local/task_m_ehz5q1/part-32.out
        at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
        at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
        at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
060417 190241 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-default.xml
060417 190241 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/mapred-default.xml

And the above exceptions repeated for many (not sure whether it is tru for all the reduce
task) other reduce tasks.

Another strange thing noticed from the logs.

On another machine's log, I saw:

060417 190528 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-site.xml
060417 190528 task_r_24d8k4 copy failed: task_m_ehz5q1 from kry1262.inktomisearch.com/72.30.117.220:60040
java.io.IOException: timed out waiting for response
        at org.apache.hadoop.ipc.Client.call(Client.java:305)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
        at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
        at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:110)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
060417 190528 task_r_24d8k4 0.11523809% reduce > copy > task_m_ehz5q1@kry1262.inktomisearch.com:60040
060417 190528 task_r_24d8k4 Copying task_m_epatk8 output from kry1387.inktomisearch.com.
                               
which is expected. However, before this line, 
I saw another copy activity in the log:

060417 103608 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-site.xml
060417 103608 task_r_a4yl3t Copying task_m_ehz5q1 output from kry1262.inktomisearch.com.
060417 103608 parsing file:/export/crawlspace/kryptonite/hadoop/conf2/hadoop-default.xml

And the task task_r_a4yl3t does not belong to the concerned job, 
according to the Web interface. That is strange.

And I checked a few other machines where some reduce tasks ran, 
and I saw the same thing.

I suspect there was a conflict in job ID. If two jobs had the same ID, 
when one closes, it may also mark the other as "closed" too, thus trggering map tasks
to clean up prematurely.

A simple way to avoid potential jobid conflict is to use sequential numbers.






> MapTask removed mapout files before the reduce tasks copy them
> --------------------------------------------------------------
>
>          Key: HADOOP-147
>          URL: http://issues.apache.org/jira/browse/HADOOP-147
>      Project: Hadoop
>         Type: Bug

>     Reporter: Runping Qi

>
> I was running a job on a cluster of 138 nodes. The job had 1050 map tasks and 128 reduce
tasks. It stucked at the reduce stage.
> All the reduce tasks were trying to copy file from a map task with the following status
show on the web interface:
> reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
> However, the log on the machine node1262 (where the map task task_m_ehz5q1 ran) showed
that the map task finished even before the 
> reduce tasks copied the map output files:
> 060417 103554 Server connection on port 60050 from 72.30.117.220: starting
> 060417 103554 task_m_ehz5q1  Client connection to 0.0.0.0:60050: starting
> 060417 103554 task_m_ehz5q1 1.0% /user/runping/runping/proj/part-00039:0+71
> 060417 103554 Task task_m_ehz5q1 is done.
> 060417 103554 parsing file:/local/hadoop/conf2/hadoop-default.xml
> ......................
> 060417 103613 parsing file:/local/hadoop/conf2/hadoop-site.xml
> 060417 103623 task_m_ehz5q1 done; removing files.
> 060417 103633 parsing file:/local/hadoop/conf2/hadoop-default.xml
> 060417 103633 parsing file:/local/hadoop/conf2/mapred-default.xml
> 060417 103633 parsing file:/local/hadoop/conf2/hadoop-site.xml
> ...........................................
> 060417 190241 SEVERE Can't open map output:/local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
> java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
>         at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
>         at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
>         at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
>         at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
>         at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
>         at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
>         at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
> 060417 190241 Unknown child with bad map output: task_m_ehz5q1. Ignored.
> 060417 190241 Server handler 2 on 60040 caught: java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
> java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
>         at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
>         at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
>         at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
>         at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
>         at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
>         at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
> 060417 190241 parsing file:/local/hadoop/conf2/hadoop-default.xml
> 060417 190241 parsing file:/local/hadoop/conf2/mapred-default.xml
> And the above exceptions repeated for many (not sure whether it is tru for all the reduce
task) other reduce tasks.
> Another strange thing noticed from the logs.
> On another machine's log, I saw:
> 060417 190528 parsing file:/local/hadoop/conf2/hadoop-site.xml
> 060417 190528 task_r_24d8k4 copy failed: task_m_ehz5q1 from node1262.foo.com/72.30.117.220:60040
> java.io.IOException: timed out waiting for response
>         at org.apache.hadoop.ipc.Client.call(Client.java:305)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
>         at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
>         at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:110)
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
> 060417 190528 task_r_24d8k4 0.11523809% reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
> 060417 190528 task_r_24d8k4 Copying task_m_epatk8 output from node1387.foo.com.
>                                
> which is expected. However, before this line, 
> I saw another copy activity in the log:
> 060417 103608 parsing file:/local/hadoop/conf2/hadoop-site.xml
> 060417 103608 task_r_a4yl3t Copying task_m_ehz5q1 output from node1262.foo.com.
> 060417 103608 parsing file:/local/hadoop/conf2/hadoop-default.xml
> And the task task_r_a4yl3t does not belong to the concerned job, 
> according to the Web interface. That is strange.
> And I checked a few other machines where some reduce tasks ran, 
> and I saw the same thing.
> I suspect there was a conflict in job ID. If two jobs had the same ID, 
> when one closes, it may also mark the other as "closed" too, thus trggering map tasks
> to clean up prematurely.
> A simple way to avoid potential jobid conflict is to use sequential numbers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message