hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
Date Thu, 07 Aug 2014 20:35:14 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089791#comment-14089791

Jonathan Hsieh commented on HBASE-6626:

Lgtm misty. 

I'm going to commit this -- it is an improvement on the current vacuous state.  We can add
more improvements in follow up issues.

committed to trunk/branch-1.  Doesn't apply cleanly to 0.98 but i think we are ok just committing
to docs to branch-1 master since that is where they are pulled from these days.

> Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
> ----------------------------------------------------------------------------------
>                 Key: HBASE-6626
>                 URL: https://issues.apache.org/jira/browse/HBASE-6626
>             Project: HBase
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.95.2
>            Reporter: Nicolas Liochon
>            Assignee: Misty Stanley-Jones
>            Priority: Blocker
>             Fix For: 0.99.0, 2.0.0
>         Attachments: HBASE-6626.patch, troubleshooting.txt
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging HBase": "HDFS
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs,
i.e. the files that will be used to restore the data after a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes the data
to multiple locations. To be efficient, HBase needs the data to be available locally, hence
it's highly recommended to have the HDFS datanode on the same machines as the HBase Region
> Detailed information on how HDFS works can be found at [1].
> Important features are:
>  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class
can appears in HBase logs with other HDFS client related logs.
>  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while
some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in
both places.
>  - the HDFS writes are pipelined from one datanode to another. When writing, there are
communications between:
>     - HBase and HDFS namenode, through the HDFS client classes.
>     - HBase and HDFS datanodes, through the HDFS client classes.
>     - HDFS datanode between themselves: issues on these communications are in HDFS logs,
not HBase. HDFS writes are always local when possible. As a consequence, there should not
be much write error in HBase Region Servers: they write to the local datanode. If this datanode
can't replicate the blocks, it will appear in its logs, not in the region servers logs.
>  - datanodes can be contacted through the ipc.Client interface (once again this class
can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode
class in the HBase logs). There are on different ports (defaults being: 50010 and 50020).
>  - To understand exactly what's going on, you must look that the HDFS log files as well:
HBase logs represent the client side.
>  - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason,
this node will still be tried by HBase or by other datanodes when writing and reading until
HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring
is performed by the NameNode.
>  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode,
but can mark temporally a node as dead if they had an error when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server connection.
Not taken into account if the error is a SocketTimeout. In this case the number of retries
is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is
hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is different than
ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with
the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries
is reused so there would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will connect to
the namenode a get a new location, sending the list of the datanodes already tried without
success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got NotReplicatedYetException, i.e.
the existing nodes of the files are not yet replicated to dfs.replication.min. This should
not impact HBase, as dfs.replication.min is defaulted to 1.
> dfs.client.max.block.acquire.failures
> Default 3
> Number of tries to read a block from the datanodes list. In other words, if 5 datanodes
are supposed to hold a block (so dfs.replication equals to 5), the client will try all these
datanodes, then check the value of dfs.client.max.block.acquire.failures to see if it should
retry or not. If so, it will get a new list (likely the same), and will try to reconnect again
to all these 5 datanodes. COuldbe be increased, especially if the socket timeouts have been
> 2.2) Timeouts
> 2.3.1) Heatbeats
> dfs.heartbeat.interval
>  Default is 3s
> heartbeat.recheck.interval = 300s
>  Defaults is 300S
> A datanode is considered as dead when there is no heartbeat for (2 * heartbeat.recheck.interval
+ 10 * dfs.heartbeat.interval) seconds. That's 630s.  So before the 10:30 minutes, the datanode
is considered as fully available by the namenode.  After this delay, HDFS is likely to start
replicating the blocks contained in the dead node to get back to the right number of replica.
As a consequence, if we're too aggressive we will have a side effect here, adding workload
to an already damaged cluster. For this reason it's not recommended to change these settings.
> As there are communications between the datanodes, and as they share these settings,
these settings are both HDFS-client-side and HDFS-server-side.
> 2.3.2) Socket timeouts
> 3 timeouts are considered in HDFS:
>  - connect timeout: the timeout when we tried to establish the connection
>  - read timeout: the timeout when we read something on an already established connection
>  - write timeout: the timeout when we try to write something on an already established
> They are managed by two settings:
> dfs.socket.timeout
> Default 60s
> dfs.datanode.socket.write.timeout
> Default is 480s.
> But these setting are used:
> - between the DFSClient and the datanode
> - between the ipc.Client and the datanodes
> - Between the datanodes
> - sometimes but not always with an extension (depending on the number of replica)
> - for dfs.socket.timeout as a socket connect timeout but as well as a socket read timeout.
> - for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java socket
is created in some cases instead of a NIO.
> final calculated connect timeout can be:
>  hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in HADOOP-7397)
>  dfs.socket.timeout  (ex: DataNode#DataTransfer, DataXceiver#replaceBlock)
>  dfs.socket.timeout + 3s*#replica  (ex: DataXceiver#write, DFSClient#getFileChecksum
called from FileCheckSumServlet)
> final read timeouts can be:
>  dfs.socket.timeout  (DataXceiver#replaceBlock, ipc.Client from DFSClient)
>  dfs.socket.timeout +  3s*#replica  (ex: DataNode#DataTransfer, DataXceiver#write)
>  dfs.socket.timeout * #replica (ex: DataNode#DataTransfer)
> final calculated write timeouts can be:
>  dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...)
>  dfs.datanode.socket.write.timeout + 5s*#replica) (ex DFSClient#createBlockOutputStream,
>  dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex: DataNode#DataTransfer. See
> Hence we will often see a 69000 timeout in the logs before the datanode is marked dead/excluded.
Also, setting "dfs.socket.timeout" to 0 does not make it wait forever, but likely 9 seconds
instead of 69s for data transfer.
> 3) Typical error logs.
> 3.1) Typical logs when all datanode for a block are dead, making the HBase recovery impossible.
HBase master logs will contain, with a 0.90 HBase:
> INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and continue java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
> => The client tries to connect to a dead datanode.
> => It failed, so the client will try the next datanode in the list. Usually the list
size is 3 (dfs.replication).
> => If the final list is empty, it means that all the datanodes proposed by the namenode
are in our datanodes list.
> => The HDFS client clears the dead nodes list and sleeps 3 seconds (hard-coded), shallowing
InterruptedException, and asks again to the namenode. This is the log line:
> INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node: java.io.IOException:
No live nodes contain current block. Will get new block locations from namenode and retry...
> => All the locations initially given by the namenode to this client are actually dead.
The client asks for a new set of locations.
> => We're very likely to have exactly the same datanode list as 3 seconds ago, except
if a Datanode came back to life or if a replication has just finished.
> => After dfs.client.max.block.acquire.failures (default: 3), an exception is thrown,
then logged, and we have in the logs:
> WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => There is another retry, hard-coded to 2, but this is logged only once, even if
the second try fails.
> => Moreover, for the second try the errors counters are not reinitialized, including
the dead nodes list, so this second attempt is unlikely to succeed. It should come again with
an empty node list, and throw a new java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => This exception will go to the final client (hbase). HBase will log it, and we will
> INFO wal.HLogSplitter: Got while parsing hlog HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy.
Marking as corrupted java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy
> 3.2) Typical log for write issues: the master reads the log, then wants to split it,
hence writing a block:
> INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
69000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=/ xxx:50010]
> => We tried to connect to the dead datanode to write. Likely from the master (it does
not have a datanode, so it connects to a remote datanode).
> => A region server will not have this type of error, as it connects to a local datanode
to write.
> => It failed at the beginning, we cannot connect at all (i.e. not during the write
> INFO HDFS.DFSClient: Abandoning block blk_xxx
> => HBase (as a HDFS client) told to the namenode that the block is not written.
> INFO HDFS.DFSClient: Excluding datanode xxx:50010
> => Internally in HDFS client the stream puts it in the excludedNodes list (the "Excluding
datanode" log line ).
> => The HDFS client is going again to the namenode asking for another datanode set
proposal, sending the excluded datanode list to be sure it's not trying on the same nodes
> => There will be 3 retries by default. If you've lost 20% of your cluster 1% of the
time the 3 attempts will fail. Setting: "dfs.client.block.write.retries". If it's the case
(i.e. all attempts failed), next log line is:
> WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new
block. And then, if it was for a split log:
> FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log (various
possible stacks here)

This message was sent by Atlassian JIRA

View raw message