Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 22 Aug 2012 03:52:38 +1100 (NCT)
From: "stack (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <1654302596.36048.1345567958297.JavaMail.jiratomcat@arcas>
In-Reply-To: <1969623048.35349.1345556018380.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (HBASE-6626) Add a chapter on HDFS in the
 troubleshooting section of the HBase reference guide.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438849#comment-13438849 ] 

stack commented on HBASE-6626:
------------------------------

Doug, you want to take on this one?
                
> Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-6626
>                 URL: https://issues.apache.org/jira/browse/HBASE-6626
>             Project: HBase
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Priority: Minor
>         Attachments: troubleshooting.txt
>
>
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging HBase": "HDFS & HBASE"
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers.
> Detailed information on how HDFS works can be found at [1].
> Important features are:
>  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs.
>  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places.
>  - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between:
>     - HBase and HDFS namenode, through the HDFS client classes.
>     - HBase and HDFS datanodes, through the HDFS client classes.
>     - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs.
>  - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020).
>  - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side.
>  - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode.
>  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried without success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the existing nodes of the files are not yet replicated to dfs.replication.min. This should not impact HBase, as dfs.replication.min is defaulted to 1.
> dfs.client.max.block.acquire.failures
> Default 3
> Number of tries to read a block from the datanodes list. In other words, if 5 datanodes are supposed to hold a block (so dfs.replication equals to 5), the client will try all these datanodes, then check the value of dfs.client.max.block.acquire.failures to see if it should retry or not. If so, it will get a new list (likely the same), and will try to reconnect again to all these 5 datanodes. COuldbe be increased, especially if the socket timeouts have been lowered.
>  
> 2.2) Timeouts
> 2.3.1) Heatbeats
> dfs.heartbeat.interval
>  Default is 3s
> heartbeat.recheck.interval = 300s
>  Defaults is 300S
> A datanode is considered as dead when there is no heartbeat for (2 * heartbeat.recheck.interval + 10 * dfs.heartbeat.interval) seconds. That's 630s.  So before the 10:30 minutes, the datanode is considered as fully available by the namenode.  After this delay, HDFS is likely to start replicating the blocks contained in the dead node to get back to the right number of replica. As a consequence, if we're too aggressive we will have a side effect here, adding workload to an already damaged cluster. For this reason it's not recommended to change these settings.
> As there are communications between the datanodes, and as they share these settings, these settings are both HDFS-client-side and HDFS-server-side.
> 2.3.2) Socket timeouts
> 3 timeouts are considered in HDFS:
>  - connect timeout: the timeout when we tried to establish the connection
>  - read timeout: the timeout when we read something on an already established connection
>  - write timeout: the timeout when we try to write something on an already established connection.
> They are managed by two settings:
> dfs.socket.timeout
> Default 60s
> dfs.datanode.socket.write.timeout
> Default is 480s.
> But these setting are used:
> - between the DFSClient and the datanode
> - between the ipc.Client and the datanodes
> - Between the datanodes
> - sometimes but not always with an extension (depending on the number of replica)
> - for dfs.socket.timeout as a socket connect timeout but as well as a socket read timeout.
> - for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java socket is created in some cases instead of a NIO.
> final calculated connect timeout can be:
>  hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in HADOOP-7397)
>  dfs.socket.timeout  (ex: DataNode#DataTransfer, DataXceiver#replaceBlock)
>  dfs.socket.timeout + 3s*#replica  (ex: DataXceiver#write, DFSClient#getFileChecksum called from FileCheckSumServlet)
> final read timeouts can be:
>  dfs.socket.timeout  (DataXceiver#replaceBlock, ipc.Client from DFSClient)
>  dfs.socket.timeout +  3s*#replica  (ex: DataNode#DataTransfer, DataXceiver#write)
>  dfs.socket.timeout * #replica (ex: DataNode#DataTransfer)
>  
> final calculated write timeouts can be:
>  dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...)
>  dfs.datanode.socket.write.timeout + 5s*#replica) (ex DFSClient#createBlockOutputStream, DataXceiver#writeBlock)
>  dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex: DataNode#DataTransfer. See HADOOP-5464).
> Hence we will often see a 69000 timeout in the logs before the datanode is marked dead/excluded. Also, setting "dfs.socket.timeout" to 0 does not make it wait forever, but likely 9 seconds instead of 69s for data transfer.
> 3) Typical error logs.
> 3.1) Typical logs when all datanode for a block are dead, making the HBase recovery impossible. HBase master logs will contain, with a 0.90 HBase:
> INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and continue java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/region-server-1:50010]
> => The client tries to connect to a dead datanode.
> => It failed, so the client will try the next datanode in the list. Usually the list size is 3 (dfs.replication).
> => If the final list is empty, it means that all the datanodes proposed by the namenode are in our datanodes list.
> => The HDFS client clears the dead nodes list and sleeps 3 seconds (hard-coded), shallowing InterruptedException, and asks again to the namenode. This is the log line:
> INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
> => All the locations initially given by the namenode to this client are actually dead. The client asks for a new set of locations.
> => We're very likely to have exactly the same datanode list as 3 seconds ago, except if a Datanode came back to life or if a replication has just finished.
> => After dfs.client.max.block.acquire.failures (default: 3), an exception is thrown, then logged, and we have in the logs:
> WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => There is another retry, hard-coded to 2, but this is logged only once, even if the second try fails.
> => Moreover, for the second try the errors counters are not reinitialized, including the dead nodes list, so this second attempt is unlikely to succeed. It should come again with an empty node list, and throw a new java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => This exception will go to the final client (hbase). HBase will log it, and we will see
> INFO wal.HLogSplitter: Got while parsing hlog HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy. Marking as corrupted java.io.IOException: Could not obtain block: blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy
> 3.2) Typical log for write issues: the master reads the log, then wants to split it, hence writing a block:
> INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ xxx:50010]
> => We tried to connect to the dead datanode to write. Likely from the master (it does not have a datanode, so it connects to a remote datanode).
> => A region server will not have this type of error, as it connects to a local datanode to write.
> => It failed at the beginning, we cannot connect at all (i.e. not during the write itself)
> INFO HDFS.DFSClient: Abandoning block blk_xxx
> => HBase (as a HDFS client) told to the namenode that the block is not written.
> INFO HDFS.DFSClient: Excluding datanode xxx:50010
> => Internally in HDFS client the stream puts it in the excludedNodes list (the "Excluding datanode" log line ).
> => The HDFS client is going again to the namenode asking for another datanode set proposal, sending the excluded datanode list to be sure it's not trying on the same nodes again.
> => There will be 3 retries by default. If you've lost 20% of your cluster 1% of the time the 3 attempts will fail. Setting: "dfs.client.block.write.retries". If it's the case (i.e. all attempts failed), next log line is:
> WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. And then, if it was for a split log:
> FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log (various possible stacks here)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira