hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data
Date Wed, 30 Mar 2011 22:57:06 GMT

    [ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013719#comment-13013719
] 

Nathan Roberts commented on HDFS-347:
-------------------------------------


We have been looking closely at the capability introduced in this Jira because the initial
results look very promising. However, after looking deeper, I’m not convinced this is an
approach that makes the most sense at this time. This Jira is all about getting the maximum
performance when the blocks of a file are on the local node. Obviously performance of this
use case is a critical piece of  “move computation to data”.  However, if going through
the datanode were to offer the same level of performance as going direct at the files, then
this Jira wouldn’t even exist. So, I think it’s really important for us to understand
the performance benefits of going direct and the real root causes of any performance differences
between going direct and having the data flow through the datanode. Once that is well understood,
then I think we could look at the value proposition of this change. We’ve tried to do some
of this analysis and the results follow. Key word here is “some”. I feel we’ve gathered
enough data to draw some valuable conclusions, but I don’t think it’s enough data to say
this type of approach wouldn’t be worth pursuing down the road. 

For the impatient, the paragraphs below can be summarized with the following points:
+ Going through the datanode maintains architectural layering. All other things being equal,
it would be best to avoid exposing the internal details of how the datanode maintains its
data. Violations of this layering could paint us into a corner down the road and therefore
should be avoided.
+ Benchmarked localhost sockets at 650MB/sec (write->read) and 1.6GB/sec(sendfile->read).
nc uses 1K buffers and this probably explains the low bandwidth observed as part of this jira.

+ Measured maximum client ingest rate at 280MB/sec for sockets. Checksum calculation seems
to play a big part of this limit. 
+ Measured maximum datanode streaming output rate of 827MB/sec.  
+ Measured maximum datanode random read output rate of 221MB/sec (with hdfs-941). 
+ The maximum client ingest rate of 280MB/sec is significantly slower than the maximum datanode
streaming output rate of 827MB/sec and only marginally faster than the maximum datanode random
output rate of 221MB/sec. This seems to say that with the current bottlenecks, there isn’t
a ton of performance to be gained from going direct, at least not for the simple test cases
used here.

For the detail oriented, keep reading.

If everything were optimized in the system then going direct is certainly going to have a
performance advantage (less layers means higher top-end performance). However, the questions
are:
+ How much of a performance gain? 
+ Can this gain be realized with existing use cases? 
+ Is the gain worth the layering violations? For example, what if we decided to automatically
merge small blks into single files? In order to access this data directly, both the datanode
and the client side code would have to be cognizant of this format. Or what if we wanted to
support encrypted content? Or if we wanted to handle I/O errors differently than they’re
handled today? I’m sure there are others I’m not thinking of.

Ok, now for some data. 

One of the initial comments talked about overhead of localhost network connections. The comment
used nc to measure bandwidth through a socket vs bandwidth through a pipe. We looked into
this a little because this was a bit surprising. Sure enough on my rhel5 system, I saw pretty
much the same numbers. Digging deeper, nc uses a 1K buffer in rhel5, this can’t be good
for throughput. So, we ran lmbench on the same system to see what sort of results we get.
localhost sockets and pipes both came in right around 660MB/sec with 64K blocksizes. Pipes
will probably scale up a bit better across more cores but I would not expect to see a 5x difference
as the original nc experiment showed. We also modified lmbench to use sendfile() instead of
write() in the local socket test and measured this throughput to be 1.6GB/sec. 

CONCLUSION: A localhost socket should be able to move around 650MB/sec for write->read,
and 1.6GB/sec for sendfile->read.

The remaining results involve hdfs. In these tests the blks being read are all in the kernel
page cache. This was done to completely remove disk seek latencies from the equation and to
completely highlight any datanode overheads.  io.file.buffer.size was 64K in all tests. (Todd
measured a 30% improvement using the direct method with checksums enabled. I can’t completely
reconcile this improvement with the results below but I’m wondering if it’s due to that
test using the default of 4K buffers??? I think the results of that test would be consistent
with the results below if that were the case. In any event it would be good to reconcile the
differences at some point.)

The next piece of data we wanted was the maximum rate at which the client can ingest data.
The first thing we did was to run a simple streaming read. In this case we saw about 280 MB/sec.
This is nowhere near 1.6GB/sec so the bottleneck must be either the client and/or the server
(i.e. it’s not the pipe). The client process was at 100% CPU, so it’s probably there.
To verify, we disabled checksum verification on the client and this number went up to 776MB/sec
and client CPU utilization was still 100%.  The bottleneck appears to still be at the client.
This is most likely due to the fact that the client  has to actually copy the data out of
the  kernel while the datanode uses sendfile. 

CONCLUSION: Maximum client ingest rate for a stream is around 280MB/sec. Datanode is capable
of streaming out at least 776MB/sec. Given current client code, there would not be a significant
advantage to going direct to the file because checksum calculation and other client overheads
limit its ingestion rate to 285MB/sec and the datanode is easily capable of sustaining this
rate for streaming reads. 

The next thing we wanted to look at was random I/O. There is a lot more overhead on the datanode
for this particular use case so this could be a place where direct access could really excel.
The first thing we did here was run a simple random read test to again measure the maximum
read throughput. In this case we measured 105MB/sec. Again we tried to eliminate the bottlenecks.
However, it’s more complicated in the random read case due to the fact that it is a request/response
type of protocol. So, first we focused on the datanode. hdfs-941 is a proposed change which
helps the pread use case significantly. The implementation in 941 seems very reasonable and
looks to be wrapping up very soon.  So, we applied the 941 patch and this improved the throughput
to 143MB/sec.  

This isn’t at the 285MB/sec yet so it’s still conceivable that going direct could add
a nice boost. 

Since this is a request/response protocol, the checksum processing on the client will impact
the overall throughput of random I/O use cases. With checksums disabled, the random I/O throughput
increased from 143MB/sec to 221MB/sec. 

CONCLUSION: A localhost socket maxes out at around 1.6GB/sec, we measured 827MB/sec for no-checksum
streaming reads. The datanode is currently not capable of maxing out a localhost socket. 

CONCLUSION: Clients can currently ingest about 280MB/sec. This rate is easily reached with
streaming reads. For random reads, with HDFS-941, this rate is a bit faster (280MB/sec vs
221MB/sec) but not dramatically so. Therefore, for today the right approach seems to be to
enhance the datanode to make sure the bottleneck is squarely at the client. Since the bottleneck
is mainly due to checksum calculation and data copies out of the kernel, going direct to a
blk file shouldn’t have a significant impact because both of these overhead activities need
to be performed whether going direct or not. 
 
The results above are all in terms of single reader throughput of cached blk files. More scalability
testing needs to be performed. We did verify that on a dual-quad core system that the datanode
could scale its random read throughput from 137MB/sec to 480MB/sec with 4 readers.  This was
enough load to saturate 5 of the 8 cores with clients consuming 3 and datanodes consuming
2.  It’s just one data point, there’s lots more work to be done in the area of datanode
scalability.

Latency is also a critical attribute of the datanode and some more data needs to be gathered
in this area. However, I propose we focus on fixing any contention/latency issues within the
datanode prior to diving into a direct I/O sort of approach (and there are already a few jiras
out there that are in the area of improving concurrency within the datanode). If we can’t
get anywhere near the latency requirements, then at that point we should consider more efficient
ways of getting at the data.

Thanks to Kihwal Lee and Dave Thompson for doing a significant amount of data gathering! Gathering
this type of data always seems to take longer than one would think, so thank you for the efforts.


 






> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: George Porter
>            Assignee: Todd Lipcon
>         Attachments: BlockReaderLocal1.txt, HADOOP-4801.1.patch, HADOOP-4801.2.patch,
HADOOP-4801.3.patch, HDFS-347-branch-20-append.txt, all.tsv, hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to move the
code to the data.  However, putting the DFS client on the same physical node as the data blocks
it acts on doesn't improve read performance as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due
to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary.
 Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running
in a separate JVM) running on the same machine.  The DataNode will satisfy the single disk
block request by sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java,
this is done in the sendChunk() method, relying on Java's transferTo() method.  Depending
on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev()
syscall or a pair of mmap() and write().  In either case, each chunk is read from the disk
by issuing a separate I/O operation for each chunk.  The result is that the single request
for a 64-MB block ends up hitting the disk as over a thousand smaller requests for 64-KB each.
> Since the DFSClient runs in a different JVM and process than the DataNode, shuttling
data from the disk to the DFSClient also results in context switches each time network packets
get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send
operations).  Thus we see a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think providing
a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine.
 It could do this by examining the set of LocatedBlocks returned by the NameNode, marking
those that should be resident on the local host.  Since the DataNode and DFSClient (probably)
share the same hadoop configuration, the DFSClient should be able to find the files holding
the block data, and it could directly open them and send data back to the client.  This would
avoid the context switches imposed by the network layer, and would allow for much larger read
buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message