hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
Date Fri, 07 Nov 2014 08:01:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201763#comment-14201763
] 

Hadoop QA commented on HDFS-7314:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12680087/HDFS-7314-4.patch
  against trunk revision ba0a42c.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified
test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 core tests{color}.  The following test timeouts occurred in hadoop-common-project/hadoop-common
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.namenode.TestFsck
org.apache.hadoop.hdfs.server.namenode.TestDeleteRace

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//console

This message is automatically generated.

> Aborted DFSClient's impact on long running service like YARN
> ------------------------------------------------------------
>
>                 Key: HDFS-7314
>                 URL: https://issues.apache.org/jira/browse/HDFS-7314
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long running service
that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; any DFSClient
request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log aggregator
or shared cache in YARN-1492. DFSClient used by YARN NM's renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease
for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Failed to download rsrc...
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. Given the
callstack is YARN -> DistributedFileSystem -> DFSClient, this can be addressed at different
layers.
> * YARN closes the DistributedFileSystem object when it receives some well defined exception.
Then the next HDFS call will create a new instance of DistributedFileSystem. We have to fix
all the places in YARN. Plus other HDFS applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance of DFSClient.
We will need to fix all the places DistributedFileSystem calls DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all requests , instead
it can retry. If NN is available again it can transition to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message