hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
Date Thu, 06 Nov 2014 20:59:34 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ming Ma updated HDFS-7314:
--------------------------
    Attachment: HDFS-7314-3.patch

Thanks, Colin. Here is the updated patch.

1. It turns out {{closeClient}} isn't necessary given when {{LeaseRenewer}} has {{DFSClient}}
close all open files, the last file's call into {{LeaseRenewer}}'s {{closeFile}} will remove
the {{DFSClient}} object. I have added the verification in the unit tests for that.
2. The logging message is kind of misleading. elapsed measured the start time of the renewLease
RPC call. So the logging will say "the lease couldn't be renewed for 30 seconds"; but the
RPC retry could take several minutes. We can leave it for another jira.

> Aborted DFSClient's impact on long running service like YARN
> ------------------------------------------------------------
>
>                 Key: HDFS-7314
>                 URL: https://issues.apache.org/jira/browse/HDFS-7314
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long running service
that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; any DFSClient
request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log aggregator
or shared cache in YARN-1492. DFSClient used by YARN NM's renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease
for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Failed to download rsrc...
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. Given the
callstack is YARN -> DistributedFileSystem -> DFSClient, this can be addressed at different
layers.
> * YARN closes the DistributedFileSystem object when it receives some well defined exception.
Then the next HDFS call will create a new instance of DistributedFileSystem. We have to fix
all the places in YARN. Plus other HDFS applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance of DFSClient.
We will need to fix all the places DistributedFileSystem calls DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all requests , instead
it can retry. If NN is available again it can transition to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message