hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Zhiqiang Ma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7180) NFSv3 gateway frequently gets stuck
Date Thu, 02 Oct 2014 03:13:33 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Zhiqiang Ma updated HDFS-7180:
-----------------------------------
    Description: 
We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on one node in
the cluster to let users upload data with rsync.

However, we find the NFSv3 daemon seems frequently get stuck while the HDFS seems working
well. (hdfds dfs -ls and etc. works just well). The last stuck we found is after around 1
day running and several hundreds GBs of data uploaded.

The NFSv3 daemon is started on one node and on the same node the NFS is mounted.

>From the node where the NFS is mounted:

dmsg shows like this:

[1859245.368108] nfs: server localhost not responding, still trying
[1859245.368111] nfs: server localhost not responding, still trying
[1859245.368115] nfs: server localhost not responding, still trying
[1859245.368119] nfs: server localhost not responding, still trying
[1859245.368123] nfs: server localhost not responding, still trying
[1859245.368127] nfs: server localhost not responding, still trying
[1859245.368131] nfs: server localhost not responding, still trying
[1859245.368135] nfs: server localhost not responding, still trying
[1859245.368138] nfs: server localhost not responding, still trying
[1859245.368142] nfs: server localhost not responding, still trying
[1859245.368146] nfs: server localhost not responding, still trying
[1859245.368150] nfs: server localhost not responding, still trying
[1859245.368153] nfs: server localhost not responding, still trying

The mounted directory can not be `ls` and `df -hT` gets stuck too.

The latest lines from the nfs3 log in the hadoop logs directory:

2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size:
35
2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size:
54
2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now
2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static UID/GID
mapping because '/etc/nfs.map' does not exist.
2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size:
35
2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size:
54
2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields
took 60062ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos:
0, targets: [10.0.3.172:50010, 10.0.3.176:50010]
2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.io.IOException: Bad response ERROR for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
from datanode 10.0.3.176:50010
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
in pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010

The logs seems suggest 10.0.3.176 is bad. However, from the `hdfs dfsadmin -report`, all nodes
in the cluster seems working.

Any help will be appreciated. Thanks in advance.

  was:
We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on one node in
the cluster to let users upload data with rsync.

However, we find the NFSv3 daemon seems frequently get stuck while the HDFS seems working
well. (hdfds dfs -ls and etc. works just well). The last stuck we found is after around 1
day running and several hundreds GBs of data uploaded.

The NFSv3 daemon is started on one node and on the same node the NFS is mounted.

>From the node where the NFS is mounted:

dmsg shows like this:

[1859245.368108] nfs: server localhost not responding, still trying
[1859245.368111] nfs: server localhost not responding, still trying
[1859245.368115] nfs: server localhost not responding, still trying
[1859245.368119] nfs: server localhost not responding, still trying
[1859245.368123] nfs: server localhost not responding, still trying
[1859245.368127] nfs: server localhost not responding, still trying
[1859245.368131] nfs: server localhost not responding, still trying
[1859245.368135] nfs: server localhost not responding, still trying
[1859245.368138] nfs: server localhost not responding, still trying
[1859245.368142] nfs: server localhost not responding, still trying
[1859245.368146] nfs: server localhost not responding, still trying
[1859245.368150] nfs: server localhost not responding, still trying
[1859245.368153] nfs: server localhost not responding, still trying

The mounted directory can not be `ls` and `df -hT` gets stuck too.

The latest lines from the nfs3 log in the hadoop logs directory:

2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size:
35
2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size:
54
2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable
write to unstable write:FILE_SYNC
2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now
2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static UID/GID
mapping because '/etc/nfs.map' does not exist.
2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size:
35
2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size:
54
2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields
took 60062ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos:
0, targets: [10.0.3.172:50010, 10.0.3.176:50010]
2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.io.IOException: Bad response ERROR for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
from datanode 10.0.3.176:50010
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
in pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010

Any help will be appreciated. Thanks in advance.


> NFSv3 gateway frequently gets stuck
> -----------------------------------
>
>                 Key: HDFS-7180
>                 URL: https://issues.apache.org/jira/browse/HDFS-7180
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: nfs
>    Affects Versions: 2.5.0
>         Environment: Linux, Fedora 19 x86-64
>            Reporter: Eric Zhiqiang Ma
>            Priority: Critical
>
> We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on one node
in the cluster to let users upload data with rsync.
> However, we find the NFSv3 daemon seems frequently get stuck while the HDFS seems working
well. (hdfds dfs -ls and etc. works just well). The last stuck we found is after around 1
day running and several hundreds GBs of data uploaded.
> The NFSv3 daemon is started on one node and on the same node the NFS is mounted.
> From the node where the NFS is mounted:
> dmsg shows like this:
> [1859245.368108] nfs: server localhost not responding, still trying
> [1859245.368111] nfs: server localhost not responding, still trying
> [1859245.368115] nfs: server localhost not responding, still trying
> [1859245.368119] nfs: server localhost not responding, still trying
> [1859245.368123] nfs: server localhost not responding, still trying
> [1859245.368127] nfs: server localhost not responding, still trying
> [1859245.368131] nfs: server localhost not responding, still trying
> [1859245.368135] nfs: server localhost not responding, still trying
> [1859245.368138] nfs: server localhost not responding, still trying
> [1859245.368142] nfs: server localhost not responding, still trying
> [1859245.368146] nfs: server localhost not responding, still trying
> [1859245.368150] nfs: server localhost not responding, still trying
> [1859245.368153] nfs: server localhost not responding, still trying
> The mounted directory can not be `ls` and `df -hT` gets stuck too.
> The latest lines from the nfs3 log in the hadoop logs directory:
> 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map
size: 35
> 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map
size: 54
> 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change
stable write to unstable write:FILE_SYNC
> 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now
> 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static
UID/GID mapping because '/etc/nfs.map' does not exist.
> 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map
size: 35
> 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map
size: 54
> 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read
fields took 60062ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos:
0, targets: [10.0.3.172:50010, 10.0.3.176:50010]
> 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
> java.io.IOException: Bad response ERROR for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
from datanode 10.0.3.176:50010
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
> 2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 in pipeline 10.0.3.172:50010,
10.0.3.176:50010: bad datanode 10.0.3.176:50010
> The logs seems suggest 10.0.3.176 is bad. However, from the `hdfs dfsadmin -report`,
all nodes in the cluster seems working.
> Any help will be appreciated. Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message