hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5286) DFS client blocked for a long time reading blocks of a file on the JobTracker
Date Thu, 26 Feb 2009 20:21:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677130#action_12677130
] 

Raghu Angadi commented on HADOOP-5286:
--------------------------------------

Thanks Hemanth. So it is the very slow reading that is causing the problem.

There are two issue here :

* first the block has only one replica while writing. 
     ** Hairong mentioned earlier in the jira that root cause of this is fixed in HADOOP-5134

* Unfortunately only replica is on a datanode that is extremely slow and hardly accessible.
I will address this.
   ** Some DataNodes are expected to be flaky and is a normal condition.
   ** This is probably the reason why replication of the block also took very long time as
well.

As I understand, the application (JobTracker in this case) is very sensitive to this delay.
I think for a good design, this slow reading should be handled in the application. There is
no QOS for hadoop file systems. Even if HDFS has some kind of option, app could face the same
problem with LocalFS. If a datanode node slowly trickles data, and that is the only replica
left, options for a filesystem are limited. In that sense I am not sure if that is a real
bug.

Do you think a critical service like JobTracker should be  sensitive to flaky datanode delays?
It probably gets less likely as bugs like HADOOP-5134 are fixed, but delays could occur for
various other reasons.

We reduce the timeout, number of retries etc in DFSClient while reading, but I don't think
that addresses the basic issue.




> DFS client blocked for a long time reading blocks of a file on the JobTracker
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-5286
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5286
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Hemanth Yamijala
>            Assignee: Raghu Angadi
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: jt-log-for-blocked-reads.txt
>
>
> On a large cluster, we've observed that DFS client was blocked on reading a block of
a file for almost 1 and half hours. The file was being read by the JobTracker of the cluster,
and was a split file of a job. On the NameNode logs, we observed that the block had a message
as follows:
> Inconsistent size for block blk_2044238107768440002_840946 reported from <ip>:<port>
current size is 195072 reported size is 1318567
> Details follow.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message