hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data
Date Thu, 15 Nov 2012 23:01:19 GMT

    [ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498420#comment-13498420
] 

Colin Patrick McCabe commented on HDFS-347:
-------------------------------------------

Here are some benchmarks I did locally on a one-node cluster.  I did these to confirm that
there are no performance regressions with the new implementation.

With HDFS-347 and {{dfs.client.read.shortcircuit}} = true and {{dfs.client.read.shortcircuit.skip.checksum}}
= false:

cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
7.46user 3.38system 0:09.50elapsed 114%CPU (0avgtext+0avgdata 423200maxresident)k
0inputs+104outputs (0major+25697minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
7.39user 3.37system 0:09.43elapsed 114%CPU (0avgtext+0avgdata 430352maxresident)k
0inputs+144outputs (0major+24399minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
7.41user 3.39system 0:09.51elapsed 113%CPU (0avgtext+0avgdata 439536maxresident)k
0inputs+144outputs (0major+25609minor)pagefaults 0swaps
=========================================
With unmodified trunk and {{dfs.client.read.shortcircuit}} = true and {{dfs.client.read.shortcircuit.skip.checksum}}
= false, and {{dfs.block.local-path-access.user}} = cmccabe:

cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
7.60user 3.58system 0:09.89elapsed 113%CPU (0avgtext+0avgdata 444848maxresident)k
0inputs+64outputs (0major+25903minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
7.65user 3.44system 0:09.57elapsed 115%CPU (0avgtext+0avgdata 443824maxresident)k
0inputs+64outputs (0major+24054minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
7.50user 3.43system 0:09.42elapsed 116%CPU (0avgtext+0avgdata 422624maxresident)k
0inputs+64outputs (0major+25918minor)pagefaults 0swaps
=========================================
with HDFS-347 and {{dfs.client.read.shortcircuit}} = false
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
10.15user 8.83system 0:17.88elapsed 106%CPU (0avgtext+0avgdata 412512maxresident)k
0inputs+224outputs (0major+24449minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
10.19user 8.55system 0:17.23elapsed 108%CPU (0avgtext+0avgdata 449248maxresident)k
0inputs+184outputs (0major+24109minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
10.24user 8.38system 0:17.16elapsed 108%CPU (0avgtext+0avgdata 439568maxresident)k
0inputs+144outputs (0major+23957minor)pagefaults 0swaps
=========================================
with unmodified trunk and {{dfs.client.read.shortcircuit}} = false

cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /
1g /1g /1g /1g /1g /1g /1g >/dev/null
10.76user 8.64system 0:18.18elapsed 106%CPU (0avgtext+0avgdata 483872maxresident)k
0inputs+64outputs (0major+28735minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
10.59user 8.54system 0:17.46elapsed 109%CPU (0avgtext+0avgdata 491216maxresident)k
0inputs+64outputs (0major+27868minor)pagefaults 0swaps
cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g
/1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
9.81user 8.95system 0:17.24elapsed 108%CPU (0avgtext+0avgdata 422144maxresident)k
0inputs+64outputs (0major+25726minor)pagefaults 0swaps

                
> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client, performance
>            Reporter: George Porter
>            Assignee: Colin Patrick McCabe
>         Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, HADOOP-4801.2.patch,
HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, HDFS-347.016.patch, HDFS-347.017.clean.patch,
HDFS-347.017.patch, HDFS-347.018.clean.patch, HDFS-347.018.patch2, HDFS-347.019.patch, HDFS-347.020.patch,
HDFS-347.021.patch, HDFS-347.022.patch, HDFS-347.024.patch, HDFS-347.025.patch, HDFS-347-branch-20-append.txt,
hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to move the
code to the data.  However, putting the DFS client on the same physical node as the data blocks
it acts on doesn't improve read performance as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due
to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary.
 Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running
in a separate JVM) running on the same machine.  The DataNode will satisfy the single disk
block request by sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java,
this is done in the sendChunk() method, relying on Java's transferTo() method.  Depending
on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev()
syscall or a pair of mmap() and write().  In either case, each chunk is read from the disk
by issuing a separate I/O operation for each chunk.  The result is that the single request
for a 64-MB block ends up hitting the disk as over a thousand smaller requests for 64-KB each.
> Since the DFSClient runs in a different JVM and process than the DataNode, shuttling
data from the disk to the DFSClient also results in context switches each time network packets
get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send
operations).  Thus we see a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think providing
a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine.
 It could do this by examining the set of LocatedBlocks returned by the NameNode, marking
those that should be resident on the local host.  Since the DataNode and DFSClient (probably)
share the same hadoop configuration, the DFSClient should be able to find the files holding
the block data, and it could directly open them and send data back to the client.  This would
avoid the context switches imposed by the network layer, and would allow for much larger read
buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message