hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2647) dfs -put hangs
Date Fri, 01 Feb 2008 04:17:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564624#action_12564624

Raghu Angadi commented on HADOOP-2647:

On 0.16 (and trunk), the same error does not make 'dfs -put' hang. User would see :
08/02/01 04:06:54 WARN fs.DFSClient: DataStreamer Exception: SocketTimeoutException [...]
08/02/01 04:06:54 WARN fs.DFSClient: Error Recovery for block null bad datanode[0]
put: Could not get block locations. Aborting...

This is probably ok. Let me know if we want to change the error message or exception thrown
to be the one that caused the problem etc... I am still finding my way around new DFSOutputStream.

> dfs -put hangs
> --------------
>                 Key: HADOOP-2647
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2647
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.15.1
>         Environment: LINUX
>            Reporter: lohit vijayarenu
>            Assignee: Raghu Angadi
>             Fix For: 0.16.1
>         Attachments: HADOOP-2647.patch
> We saw a case where dfs -put hung while copying a 2GB file for over 20 hours.
> When we took a look at the stack trace of process the main thread was waiting for confirmation
from namenode for complete status.
> only 4 blocks were copied and 5th block was missing when we ran fsck on the partially
transfered file. 
> There are 2 problems we saw here.
> 1. DFS client hung without a timeout when there is no response from namenode.
> 2. In IOUtils::copyBytes(InputStream in, OutputStream out, int buffSize, boolean close)
> During copy, if there is an exception, the out.close() is called. Exception is not caught.
Which is why we see a close call in the stack trace. 
> When we checked for block IDs in namenode log. For the block which was missing, there
was only one response to namenode instead of three.
> This close state coupled with namenode not reporting the error back might have cause
the whole process to hang.
> Opening this JIRA to see if we could add checks to the two problems mentioned above.
> <stack trace of main thread>
> "main" prio=10 tid=0x0805a000 nid=0x5b53 waiting on condition [0xf7e64000..0xf7e65288]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method) 
>   at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1751)  - locked
<0x77d593a0> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream)  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
>   at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)  at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:83)  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:140)
>   at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:826)
>   at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:114)
>   at org.apache.hadoop.fs.FsShell.run(FsShell.java:1354)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>   at org.apache.hadoop.fs.FsShell.main(FsShell.java:1472)
> </stack trace of main thread>

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message