hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Pendleton (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-35) Files missing chunks can cause mapred runs to get stuck
Date Mon, 13 Feb 2006 23:26:43 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-35?page=comments#action_12366260 ] 

Bryan Pendleton commented on HADOOP-35:
---------------------------------------

Alright, well, for diagnostic purposes, I added a "health" stat to each file in -ls. The getFileHealth()
function I define could probably be stuck into a utility class somewhere... especially if
it were to be called periodically during a mapreduce run. The patch also refactors how the
"Configuration" instance is used in the DFSShell.

Also, I had to add a try/catch around the call to getHints(), because for several of my existing
files I get the following stack track when trying to call getHints():

Exception in thread "main" java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.ipc.Client.call(Client.java:301)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
        at org.apache.hadoop.dfs.$Proxy0.getHints(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient.getHints(DFSClient.java:69)
        at org.apache.hadoop.dfs.DistributedFileSystem.getFileCacheHints(DistributedFileSystem.java:63)
        at org.apache.hadoop.dfs.DFSShell.getFileHealth(DFSShell.java:43)
        at org.apache.hadoop.dfs.DFSShell.ls(DFSShell.java:117)
        at org.apache.hadoop.dfs.DFSShell.main(DFSShell.java:283)

> Files missing chunks can cause mapred runs to get stuck
> -------------------------------------------------------
>
>          Key: HADOOP-35
>          URL: http://issues.apache.org/jira/browse/HADOOP-35
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>  Environment: ~20 datanode DFS cluster
>     Reporter: Bryan Pendleton

>
> I've now several times run into a problem where a large run gets stalled as a result
of a missing data block. The latest was a stall in the Summer - ie, the data might've all
been there, but it was impossible to proceed because the CRC file was missing a block. It
would be nice to:
> 1) Have a "health check" running on a map reduce. If any data isn't available, emmit
periodic warnings, and maybe have a timeout for if the data never comes back. Such warnings
*should* specify which file(s) are affected by the missing blocks.
> 2) Have a utility, possible part of the existing dfs utility, which can check for dfs
files with unlocatable blocks. Possibly, even show a 'health' of a file - ie, what percentage
of its blocks are currently at the desired replication level. Currently, there's no way that
I know of to find out if a file in DFS is going to be unreadable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message