hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
Date Tue, 23 Jun 2015 16:11:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597873#comment-14597873
] 

Jason Lowe commented on YARN-3832:
----------------------------------

Ah, I think that might be the clue as to what went wrong.  If the NM recreated the state store
on startup then ResourceLocalizationService will try to cleanup the localized resources to
prevent them from getting out of sync with the state store.  Unfortunately the code does this:
{code}
  private void cleanUpLocalDirs(FileContext lfs, DeletionService del) {
    for (String localDir : dirsHandler.getLocalDirs()) {
      cleanUpLocalDir(lfs, del, localDir);
    }
{code}

It should be calling dirsHandler.getLocalDirsForCleanup, since getLocalDirs will not include
any disks that are full.  Since the disk was too full, it probably wasn't in the list of local
dirs and therefore we avoided cleaning up the localized resources on the disk.  Later when
the disk became good it tried to use it, but at that point the state store and localized resources
on that disk are out of sync and new localizations can collide with old ones.

> Resource Localization fails on a cluster due to existing cache directories
> --------------------------------------------------------------------------
>
>                 Key: YARN-3832
>                 URL: https://issues.apache.org/jira/browse/YARN-3832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: Ranga Swamy
>            Assignee: Brahma Reddy Battula
>
>  *We have found resource localization fails on a cluster with following error.* 
>  
> Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
> {noformat}
> Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_000002
exited with exitCode: -1000
> For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
click on links to logs of each attempt.
> Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
> at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
> at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Failing this attempt. Failing the application.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message