hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1084) Reinitializable DFS client
Date Wed, 24 Dec 2008 18:44:44 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659139#action_12659139
] 

Andrew Purtell commented on HBASE-1084:
---------------------------------------

We had something exactly like this happen today:

https://issues.apache.org/jira/browse/HBASE-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659135#action_12659135

However, then the affected region appeared to have a missing block no matter where it was
reassigned. (I do not believe reassignment to the restarted regionserver was attempted, however.)
A shutdown and restart of all regionservers was then necessary. DFS daemons were left alone.
The newly started regionservers had no problems compacting and serving the formerly affected
region.


> Reinitializable DFS client
> --------------------------
>
>                 Key: HBASE-1084
>                 URL: https://issues.apache.org/jira/browse/HBASE-1084
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io, master, regionserver
>            Reporter: Andrew Purtell
>             Fix For: 0.20.0
>
>
> HBase is the only long lived DFS client. Tasks handle DFS errors by dying. HBase daemons
do not and instead depend on dfsclient error recovery capability, but that is not sufficiently
developed or tested. Several issues are a result:
> * HBASE-846: hbase looses its mind when hdfs fills
> * HBASE-879: When dfs restarts or moves blocks around, hbase regionservers don't notice
> * HBASE-932: Regionserver restart
> * HBASE-1078: "java.io.IOException: Could not obtain block": allthough file is there
and accessible through the dfs client
> * hlog indefinitely hung on getting new blocks from dfs on apurtell cluster
> * regions closed due to transient DFS problems during loaded cluster restart
> These issues might also be related:
> * HBASE-15: Could not complete hdfs write out to flush file forcing regionserver restart
> * HBASE-667: Hung regionserver; hung on hdfs: writeChunk, DFSClient.java:2126, DataStreamer
socketWrite
> HBase should reinitialize the fs a few times upon catching fs exceptions, with backoff,
to compensate. This can be done by making a wrapper around all fs operations that releases
references to the old fs instance and makes and initializes a new instance to retry. All fs
users would need to be fixed up to handle loss of state around fs wrapper invocations: hlog,
memcache flusher, hstore, etc. 
> Cases of clear unrecoverable failure (are there any?) should be excepted.
> Once the fs wrapper is in place, error recovery scenarios can be tested by forcing reinitialization
of the fs during PE or other test cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message