hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
Date Tue, 21 Aug 2012 13:33:38 GMT
nkeywal created HBASE-6626:
------------------------------

             Summary: Add a chapter on HDFS in the troubleshooting section of the HBase reference
guide.
                 Key: HBASE-6626
                 URL: https://issues.apache.org/jira/browse/HBASE-6626
             Project: HBase
          Issue Type: Improvement
          Components: documentation
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Priority: Minor


I looked mainly at the major failure case, but here is what I have:

New sub chapter in the existing chapter "Troubleshooting and Debugging HBase": "HDFS &
HBASE"

1) HDFS & HBase
2) Connection related settings
2.1) Number of retries
2.2) Timeouts
3) Log samples


1) HDFS & HBase
HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e.
the files that will be used to restore the data after a crash.
In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple
locations. To be efficient, HBase needs the data to be available locally, hence it's highly
recommended to have the HDFS datanode on the same machines as the HBase Region Servers.

Detailed information on how HDFS works can be found at [1].

Important features are:
 - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can
appears in HBase logs with other HDFS client related logs.
 - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some
other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both
places.
 - the HDFS writes are pipelined from one datanode to another. When writing, there are communications
between:
    - HBase and HDFS namenode, through the HDFS client classes.
    - HBase and HDFS datanodes, through the HDFS client classes.
    - HDFS datanode between themselves: issues on these communications are in HDFS logs, not
HBase. HDFS writes are always local when possible. As a consequence, there should not be much
write error in HBase Region Servers: they write to the local datanode. If this datanode can't
replicate the blocks, it will appear in its logs, not in the region servers logs.
 - datanodes can be contacted through the ipc.Client interface (once again this class can
shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode
class in the HBase logs). There are on different ports (defaults being: 50010 and 50020).
 - To understand exactly what's going on, you must look that the HDFS log files as well: HBase
logs represent the client side.
 - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason,
this node will still be tried by HBase or by other datanodes when writing and reading until
HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring
is performed by the NameNode.
 - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode,
but can mark temporally a node as dead if they had an error when they tried to use it.

2) Settings for retries and timeouts
2.1) Retries
ipc.client.connect.max.retries
Default 10
Indicates the number of retries a client will make to establish a server connection. Not taken
into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed
on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to
15. Can be increased, especially if the socket timeouts have been lowered.

ipc.client.connect.max.retries.on.timeouts
Default 45
If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries
so if you mix the socket errors you will get 55 retries with the default values. Could be
lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so
there would be 10 tries.

dfs.client.block.write.retries
Default 3
Number of tries for the client when writing a block. After a failure, will connect to the
namenode a get a new location, sending the list of the datanodes already tried without success.
Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490.

dfs.client.block.write.locateFollowingBlock.retries
Default 5
Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the
existing nodes of the files are not yet replicated to dfs.replication.min. This should not
impact HBase, as dfs.replication.min is defaulted to 1.

dfs.client.max.block.acquire.failures
Default 3
Number of tries to read a block from the datanodes list. In other words, if 5 datanodes are
supposed to hold a block (so dfs.replication equals to 5), the client will try all these datanodes,
then check the value of dfs.client.max.block.acquire.failures to see if it should retry or
not. If so, it will get a new list (likely the same), and will try to reconnect again to all
these 5 datanodes. COuldbe be increased, especially if the socket timeouts have been lowered.
 

2.2) Timeouts
2.3.1) Heatbeats
dfs.heartbeat.interval
 Default is 3s

heartbeat.recheck.interval = 300s
 Defaults is 300S

A datanode is considered as dead when there is no heartbeat for (2 * heartbeat.recheck.interval
+ 10 * dfs.heartbeat.interval) seconds. That's 630s.  So before the 10:30 minutes, the datanode
is considered as fully available by the namenode.  After this delay, HDFS is likely to start
replicating the blocks contained in the dead node to get back to the right number of replica.
As a consequence, if we're too aggressive we will have a side effect here, adding workload
to an already damaged cluster. For this reason it's not recommended to change these settings.

As there are communications between the datanodes, and as they share these settings, these
settings are both HDFS-client-side and HDFS-server-side.

2.3.2) Socket timeouts
3 timeouts are considered in HDFS:
 - connect timeout: the timeout when we tried to establish the connection
 - read timeout: the timeout when we read something on an already established connection
 - write timeout: the timeout when we try to write something on an already established connection.


They are managed by two settings:

dfs.socket.timeout
Default 60s

dfs.datanode.socket.write.timeout
Default is 480s.

But these setting are used:
- between the DFSClient and the datanode
- between the ipc.Client and the datanodes
- Between the datanodes
- sometimes but not always with an extension (depending on the number of replica)
- for dfs.socket.timeout as a socket connect timeout but as well as a socket read timeout.
- for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java socket is created
in some cases instead of a NIO.


final calculated connect timeout can be:
 hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in HADOOP-7397)
 dfs.socket.timeout  (ex: DataNode#DataTransfer, DataXceiver#replaceBlock)
 dfs.socket.timeout + 3s*#replica  (ex: DataXceiver#write, DFSClient#getFileChecksum called
from FileCheckSumServlet)

final read timeouts can be:
 dfs.socket.timeout  (DataXceiver#replaceBlock, ipc.Client from DFSClient)
 dfs.socket.timeout +  3s*#replica  (ex: DataNode#DataTransfer, DataXceiver#write)
 dfs.socket.timeout * #replica (ex: DataNode#DataTransfer)
 
final calculated write timeouts can be:
 dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...)
 dfs.datanode.socket.write.timeout + 5s*#replica) (ex DFSClient#createBlockOutputStream, DataXceiver#writeBlock)
 dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex: DataNode#DataTransfer. See HADOOP-5464).

Hence we will often see a 69000 timeout in the logs before the datanode is marked dead/excluded.
Also, setting "dfs.socket.timeout" to 0 does not make it wait forever, but likely 9 seconds
instead of 69s for data transfer.


3) Typical error logs.
3.1) Typical logs when all datanode for a block are dead, making the HBase recovery impossible.
HBase master logs will contain, with a 0.90 HBase:
INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and continue java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=/region-server-1:50010]
=> The client tries to connect to a dead datanode.
=> It failed, so the client will try the next datanode in the list. Usually the list size
is 3 (dfs.replication).
=> If the final list is empty, it means that all the datanodes proposed by the namenode
are in our datanodes list.
=> The HDFS client clears the dead nodes list and sleeps 3 seconds (hard-coded), shallowing
InterruptedException, and asks again to the namenode. This is the log line:
INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node: java.io.IOException: No
live nodes contain current block. Will get new block locations from namenode and retry...
=> All the locations initially given by the namenode to this client are actually dead.
The client asks for a new set of locations.
=> We're very likely to have exactly the same datanode list as 3 seconds ago, except if
a Datanode came back to life or if a replication has just finished.
=> After dfs.client.max.block.acquire.failures (default: 3), an exception is thrown, then
logged, and we have in the logs:
WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
=> There is another retry, hard-coded to 2, but this is logged only once, even if the second
try fails.
=> Moreover, for the second try the errors counters are not reinitialized, including the
dead nodes list, so this second attempt is unlikely to succeed. It should come again with
an empty node list, and throw a new java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
=> This exception will go to the final client (hbase). HBase will log it, and we will see
INFO wal.HLogSplitter: Got while parsing hlog HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy.
Marking as corrupted java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy


3.2) Typical log for write issues: the master reads the log, then wants to split it, hence
writing a block:
INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
69000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=/ xxx:50010]
=> We tried to connect to the dead datanode to write. Likely from the master (it does not
have a datanode, so it connects to a remote datanode).
=> A region server will not have this type of error, as it connects to a local datanode
to write.
=> It failed at the beginning, we cannot connect at all (i.e. not during the write itself)
INFO HDFS.DFSClient: Abandoning block blk_xxx
=> HBase (as a HDFS client) told to the namenode that the block is not written.
INFO HDFS.DFSClient: Excluding datanode xxx:50010
=> Internally in HDFS client the stream puts it in the excludedNodes list (the "Excluding
datanode" log line ).
=> The HDFS client is going again to the namenode asking for another datanode set proposal,
sending the excluded datanode list to be sure it's not trying on the same nodes again.
=> There will be 3 retries by default. If you've lost 20% of your cluster 1% of the time
the 3 attempts will fail. Setting: "dfs.client.block.write.retries". If it's the case (i.e.
all attempts failed), next log line is:
WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
And then, if it was for a split log:
FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log (various possible
stacks here)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message