hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HADOOP-5903) DFSClient "Could not obtain block:..."
Date Sun, 24 May 2009 03:22:45 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Douglas resolved HADOOP-5903.

    Resolution: Duplicate

Duplicate of HADOOP-3185, HADOOP-4681

> DFSClient "Could not obtain block:..."
> --------------------------------------
>                 Key: HADOOP-5903
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5903
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.3, 0.19.0, 0.19.1, 0.20.0
>            Reporter: stack
> We see this frequently in our application, hbase, where dfsclients are held open across
long periods of time. It would seem that any hiccup fetching a block becomes a permanent black
mark and though the serving datanode passes out a temporary slowness or outage, the dfsclient
never seems to pick up on this fact.  Our perception is too sensitive to the vagaries of cluster
comings and goings and succumbs too easily, especially given that a fresh dfsclient has not
problem fetching the designated block.
> Chatting with Raghu and Hairong yesterday, Hairong pointed out that the dfsclient frequently
updates its list of block locations -- if a block has moved or if a datanode is dead, then
dfsclient should be keeping with the changing state of the cluster (I see this happening in
DFSClient#chooseDatanode on failure) but Raghu looks like he put his finger on our problem
by noticing that the failures count is only incremented -- never decremented.  ANY three failures,
no matter how many blocks in a file nor that a block that failed once now works, are enough
for the DFSClient to start throwing "Could not obtain block:...".
> The failures counter needs to be a little smarter.  Would a patch that adds a map of
blocks to failure counts be the right way to go?  Failures should note the datanode that the
failure was gotten against so that if the datanode came online again (retry), we could decrement
the mark that had made against the block?
> What do folks think?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message