hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2391) Speculative Execution race condition with output paths
Date Wed, 06 Feb 2008 16:29:08 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566183#action_12566183
] 

Arun C Murthy commented on HADOOP-2391:
---------------------------------------

We had a hallway discussion which discussed the following:

1. The ${mapred.output.dir}/_temp/_${taskid} as illustrated by Amareshwari's comment.

_Pros_ : 
a) Easy to implement
b) Keeps job's output in ${mapred.output.dir}, so it ensures there are no junk files on HDFS
which aren't noticeable by anyone (we assume the user will notice this *smile*).

_Cons_: This still doesn't solve the problem... the issue is that there might be tasks which
get launched as the job is completing and go ahead and create the _${taskid} directory (see
HADOOP-2759 i.e. HDFS *create* automatically creates parent-directories). This problem is
further aggravated by tasks creating side-files in the _${taskid} directory; another point
to remember is that the OutputFormat is user-code...

2. Put task-outputs in a job-specific temporary system-directory outside the ${mapred.output.dir}
and then move them into ${mapred.output.dir}.

The problem with this approach is that although it is simple and solves the problems on-hand,
we might be left with random files on HDFS which will never ever be noticed by anyone - leading
to space-creep and at the very least requires a _tmpclean_ tool. We also need to study how
this will work with permissions and quotas.

3. Do not declare a job as complete till all it's TIPs have succeeded and all speculative
tasks are killed.

_Pros_:
a) It's probably the most _correct_ solution of the lot.
b) This will mostly work (see the _Cons_).

_Cons_:
a) Implementation is a little more involved... (we probably need to mark the Job as "done,
but cleaning up")
b) There are corner cases: think of a job which is complete, and whose speculative tasks are
running on a TaskTracker which is _lost_ before the task is killed... we need to wait atleast
10mins (current timeout) before declaring the TaskTracker as _lost_ and the job as SUCCESS.
Even this doesn't guarantee that the _task_ is actually dead since it could still be running
on the TaskTracker node... and creating side-files etc. (again HADOOP-2759).
c) The _lost tasktracker_ problem described above potentially adds a finite lag to jobs being
declared a success. This doesn't play well with short-running jobs which need SLAs on completion/failure
times (of course they can set the TaskTracker timeout to be less than 10mins on their clusters,
just something to consider). 

----

Overall combination of 1 & 3 i.e. having a single ${mapred.output.dir}/_tmp as the parent
of all temporary tasks' directories and also waiting for all tasks to be killed might work
well in most cases. We still need to fix HADOOP-2759, or at least add a *create* api which
doesn't automatically create parent directories for this to work. 

Thoughts?

Note: Adding a new *create* api which doesn't automatically create parent dirs is a part of
the solution, the other part is to educate users to not use the _old_ create api in their
own OutputFormats.
  

> Speculative Execution race condition with output paths
> ------------------------------------------------------
>
>                 Key: HADOOP-2391
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2391
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Devaraj Das
>             Fix For: 0.16.1
>
>         Attachments: HADOOP-2391-1-20071211.patch
>
>
> I am tracking a problem where when speculative execution is enabled, there is a race
condition when trying to read output paths from a previously completed job.  More specifically
when reduce tasks run their output is put into a working directory under the task name until
the task in completed.  The directory name is something like workdir/_taskid.  Upon completion
the output get moved into workdir.  Regular tasks are checked for this move and not considered
completed until this move is made.  I have not verified it but all indications point to speculative
tasks NOT having this same check for completion and more importantly removal when killed.
 So what we end up with when trying to read the output of previous tasks with speculative
execution enabled is the possibility that previous workdir/_taskid will be present when the
output directory is read by a chained job.  Here is an error when supports my theory:
> Generator: org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename
/u01/hadoop/mapred/temp/generate-temp-1197104928603/_task_200712080949_0005_r_000014_1
>         at org.apache.hadoop.dfs.NameNode.open(NameNode.java:234)
>         at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:389)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:644)
>         at org.apache.hadoop.ipc.Client.call(Client.java:507)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:186)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:839)
>         at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:831)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:263)
>         at org.apache.hadoop.dfs.DistributedFileSystem.open(DistributedFileSystem.java:114)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1356)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:87)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:429)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:526)
> I will continue to research this and post as I make progress on tracking down this bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message