hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Bacsko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
Date Mon, 27 Nov 2017 15:08:00 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peter Bacsko updated MAPREDUCE-7015:
------------------------------------
    Description: 
There could be a race condition inside JHS. In our build environment, {{TestMRJobClient.testJobClient()}}
failed with this exception:

{noformat}
ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
{noformat}

Root cause:
1. MapReduce job completes
2. CLI calls {{cluster.getJob(jobid)}}
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find the job
5. First it scans the intermediate directory and finds the job
6. The call {{moveToDone()}} is scheduled for execution on a separate thread inside {{moveToDoneExecutor}}
but does not get the chance to run immediately
7. RPC invocation returns with the path pointing to {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
8. The call to {{moveToDone()}} completes which moves the contents of {{done_intermediate}}
to {{done}}
9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer
there

Usually step #6 is fast enough to complete before step #7, but sometimes it can get behind,
causing this race condition.

  was:
There could be a race condition inside JHS. In our build environment, {{TestMRJobClient.testJobClient()}}
failed with this exception:

{noformat}
ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
{noformat}

Root cause:
1. MapReduce job completes
2. CLI calls {{cluster.getJob(jobid)}}
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
5. First it scans the intermediate directory and finds the job
6. The call {{moveToDone()}} is scheduled for execution on a separate thread inside {{moveToDoneExecutor}}
but does not get the chance to run immediately
7. RPC invocation returns with the path pointing to {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
8. The call to {{moveToDone()}} completes which moves the contents of {{done_intermediate}}
to {{done}}
9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer
there

Usually step #6 is fast enough to complete before step #7, but sometimes it can get behind,
causing this race condition.


> Possible race condition in JHS if the job is not loaded
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-7015
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>
> There could be a race condition inside JHS. In our build environment, {{TestMRJobClient.testJobClient()}}
failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
> 	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
> 	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
> 	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
> 	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
> 	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
> 	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
> 	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
> 	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread inside {{moveToDoneExecutor}}
but does not get the chance to run immediately
> 7. RPC invocation returns with the path pointing to {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of {{done_intermediate}}
to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer
there
> Usually step #6 is fast enough to complete before step #7, but sometimes it can get behind,
causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message