Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2284417E3E for ; Thu, 2 Oct 2014 03:07:35 +0000 (UTC) Received: (qmail 16599 invoked by uid 500); 2 Oct 2014 03:07:34 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 16471 invoked by uid 500); 2 Oct 2014 03:07:33 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 16458 invoked by uid 99); 2 Oct 2014 03:07:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Oct 2014 03:07:33 +0000 Date: Thu, 2 Oct 2014 03:07:33 +0000 (UTC) From: "Eric Zhiqiang Ma (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7180) NFSv3 gateway frequently gets stuck MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Eric Zhiqiang Ma created HDFS-7180: -------------------------------------- Summary: NFSv3 gateway frequently gets stuck Key: HDFS-7180 URL: https://issues.apache.org/jira/browse/HDFS-7180 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.5.0 Environment: Linux, Fedora 19 x86-64 Reporter: Eric Zhiqiang Ma Priority: Critical We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on one node in the cluster to let users upload data with rsync. However, we find the NFSv3 daemon seems frequently get stuck while the HDFS seems working well. (hdfds dfs -ls and etc. works just well). The last stuck we found is after around 1 day running and several hundreds GBs of data uploaded. The NFSv3 daemon is started on one node and on the same node the NFS is mounted. >From the node where the NFS is mounted: dmsg shows like this: [1859245.368108] nfs: server localhost not responding, still trying [1859245.368111] nfs: server localhost not responding, still trying [1859245.368115] nfs: server localhost not responding, still trying [1859245.368119] nfs: server localhost not responding, still trying [1859245.368123] nfs: server localhost not responding, still trying [1859245.368127] nfs: server localhost not responding, still trying [1859245.368131] nfs: server localhost not responding, still trying [1859245.368135] nfs: server localhost not responding, still trying [1859245.368138] nfs: server localhost not responding, still trying [1859245.368142] nfs: server localhost not responding, still trying [1859245.368146] nfs: server localhost not responding, still trying [1859245.368150] nfs: server localhost not responding, still trying [1859245.368153] nfs: server localhost not responding, still trying The mounted directory can not be `ls` and `df -hT` gets stuck too. The latest lines from the nfs3 log in the hadoop logs directory: 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size: 35 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size: 54 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist. 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size: 35 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size: 54 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 60062ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: [10.0.3.172:50010, 10.0.3.176:50010] 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 java.io.IOException: Bad response ERROR for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 from datanode 10.0.3.176:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828) 2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 in pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010 Any help will be appreciated. Thanks in advance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)