Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 204A9DC83 for ; Mon, 1 Oct 2012 21:01:09 +0000 (UTC) Received: (qmail 25403 invoked by uid 500); 1 Oct 2012 21:01:08 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 25367 invoked by uid 500); 1 Oct 2012 21:01:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 25358 invoked by uid 99); 1 Oct 2012 21:01:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2012 21:01:08 +0000 Date: Tue, 2 Oct 2012 08:01:08 +1100 (NCT) From: "Colin Patrick McCabe (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <772079917.150497.1349125268885.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-347: -------------------------------------- Attachment: HDFS-347-016_cleaned.patch This patch only includes HDFS-347. * DataChecksum#newDataChecksum: correctly handle offset values other than 0. * BlockReader / BlockReaderUtil: add skipFully and available methods. Add JavaDoc for skip method. The available method returns a rough approximation of how much data might be available without doing any more network I/O. This helps us optimize in the case where we are reading from a local file descriptor, since we never do network I/O in that case. * BlockReaderLocal: simpler implementation that uses raw FileChannel objects. We don't need to cache anything, or make RPCs to the DataNode. * DFSClient / DFSInputStream: update getLocalBlockReader to work with fd passing. Rather than overloading AccessControlException to mean "local reads were not enabled," create a new exception called LocalReadsDisabledException and throw it when that is the case. This will prevent confusion in the future. Use skipFully instead of skip, since the latter may give us short skips. * DFSConfigKeys: don't need dfs.block.local-path-access.user any more. Local reads are now on by default rather than disabled by default. * RPC stuff: add BlockLocalFdInfo. Deprecate BlockLocalPathInfo. Implement the DataNode, FsDatasetIMpl, etc. methods. Add GetBlockLocalFdInfoResponseProto. The old RPC is now deprecated and will always throw an AccessControlException, so that older clients will fall back to remote reads. * MiniDFSCluster: add getBlockMetadataFile which is like getBlockFile except that it returns .meta files. * Tests: TestBlockReaderLocal now includes more tests of BlockReaderLocal in isolation. TestParallelRead now explictly disables local reads (that case is testsed by TestParalellLocalRead). TestShortCircuitLocalRead: add testDeprecatedGetBlockLocalPathInfoRpc to test the deprecated RPC. > DFS read performance suboptimal when client co-located on nodes with data > ------------------------------------------------------------------------- > > Key: HDFS-347 > URL: https://issues.apache.org/jira/browse/HDFS-347 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node, hdfs client, performance > Reporter: George Porter > Assignee: Todd Lipcon > Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, HDFS-347.016.patch, HDFS-347-branch-20-append.txt, hdfs-347.png, hdfs-347.txt, local-reads-doc > > > One of the major strategies Hadoop uses to get scalable data processing is to move the code to the data. However, putting the DFS client on the same physical node as the data blocks it acts on doesn't improve read performance as much as expected. > After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running in a separate JVM) running on the same machine. The DataNode will satisfy the single disk block request by sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, this is done in the sendChunk() method, relying on Java's transferTo() method. Depending on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev() syscall or a pair of mmap() and write(). In either case, each chunk is read from the disk by issuing a separate I/O operation for each chunk. The result is that the single request for a 64-MB block ends up hitting the disk as over a thousand smaller requests for 64-KB each. > Since the DFSClient runs in a different JVM and process than the DataNode, shuttling data from the disk to the DFSClient also results in context switches each time network packets get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send operations). Thus we see a large number of context switches for each block send operation. > I'd like to get some feedback on the best way to address this, but I think providing a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine. It could do this by examining the set of LocatedBlocks returned by the NameNode, marking those that should be resident on the local host. Since the DataNode and DFSClient (probably) share the same hadoop configuration, the DFSClient should be able to find the files holding the block data, and it could directly open them and send data back to the client. This would avoid the context switches imposed by the network layer, and would allow for much larger read buffers than 64KB, which should reduce the number of iops imposed by each read block operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira