Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A22BE176B5 for ; Wed, 6 May 2015 03:26:47 +0000 (UTC) Received: (qmail 12096 invoked by uid 500); 6 May 2015 03:26:47 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 12045 invoked by uid 500); 6 May 2015 03:26:47 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 12031 invoked by uid 99); 6 May 2015 03:26:47 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2015 03:26:47 +0000 Date: Wed, 6 May 2015 03:26:47 +0000 (UTC) From: "Allen Wittenauer (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-7314: ----------------------------------- Labels: BB2015-05-TBR (was: ) > Aborted DFSClient's impact on long running service like YARN > ------------------------------------------------------------ > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Ming Ma > Assignee: Ming Ma > Labels: BB2015-05-TBR > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well defined exception. Then the next HDFS call will create a new instance of DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance of DFSClient. We will need to fix all the places DistributedFileSystem calls DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all requests , instead it can retry. If NN is available again it can transition to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)