hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matteo Bertozzi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low
Date Tue, 16 Jun 2015 20:30:02 GMT

     [ https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matteo Bertozzi updated HBASE-13832:
------------------------------------
    Attachment: HBASE-13832-v0.patch
                HDFSPipeline.java

Added a patch which has more or less the same logic as FSHLog. it tries to roll on sync failure
before aborting the master. 
at some point I'll try to move out the FSHLog stuff and make a generic WAL, so we have a single
implementation for the rolling/encryption and similar things.

> Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data
nodes count is low
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13832
>                 URL: https://issues.apache.org/jira/browse/HBASE-13832
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, proc-v2
>    Affects Versions: 2.0.0, 1.1.0, 1.2.0
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>         Attachments: HBASE-13832-v0.patch, HDFSPipeline.java
>
>
> when the data node < 3, we got failure in WALProcedureStore#syncLoop() during master
start.  The failure prevents master to get started.  
> {noformat}
> 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Sync
slot failed, abort.
> java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to
no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
                    original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-    490ece56c772,DISK]]).
The current failed datanode replacement policy is DEFAULT, and a client may configure this
via 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its configuration.
>   at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
> {noformat}
> One proposal is to implement some similar logic as FSHLog: if IOException is thrown during
syncLoop in WALProcedureStore#start(), instead of immediate abort, we could try to roll the
log and see whether this resolve the issue; if the new log cannot be created or more exception
from rolling the log, we then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message