hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data
Date Tue, 13 Oct 2009 05:50:31 GMT

    [ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764957#action_12764957

Todd Lipcon commented on HDFS-347:

bq. Filed HADOOP-3205 quite sometime back. It would let BlockReader avoid a copy as well.

Cool, I'll take a look at that. I agree that it should help performance.

bq. Is this a temporary hack? Client sees non-loopback address even if the datanode is local.

Yep - see the "work remaining" above. This was good enough for testing on my pseudo-distributed
cluster, but we'll need something fancier for the real deal. I think your trick about connecting
and then looking at the addresses may be just right - very clever!

bq. we could use "dfsclinet_some_rand_str_with_blockid".

Yea, I thought I had a TODO in there. If not, I mean to :) We should have some kind of random
string (or autoincrement static field in DFSClient). Making datanode listen still seems like
it's susceptible to some kind of impostor attack.

bq. since random read bottleneck is connection latency and disk seeks, why do you think this
improves random read performance?

It currently doesn't. I anticipate adding some kind of call like "boolean canSeekToBlockOffset(long
pos)" to the BlockReader interface. In the case of LocalBlockReader, it can seek "for free"
within the already-open file descriptor with zero latency beyond what the IO itself might
cost. I was planning on adding that in another JIRA, but could certainly add it here too.
For the existing BlockReader, we can return true for the case when the target position is
within the TCP window -- this is an optimization currently in DFSInputStream that should move
into BlockReader.

bq. does a typical HBase installation do predominantly do local reads?

I'm not sure of this - I think some of the HBase guys are watching this ticket. It seems to
me, though, that it shouldn't be hard to convince HBase region servers to match up with local

bq. I suspect reporting bytes read by clients might be important issue to fix properly. 

+1. The improper termination is probably impossible to deal with, but an honesty-based policy
would work, as long as we figure that clients have nothing to gain by lying.

> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: George Porter
>            Assignee: Todd Lipcon
>         Attachments: HADOOP-4801.1.patch, HADOOP-4801.2.patch, HADOOP-4801.3.patch, hdfs-347.txt,
> One of the major strategies Hadoop uses to get scalable data processing is to move the
code to the data.  However, putting the DFS client on the same physical node as the data blocks
it acts on doesn't improve read performance as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due
to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary.
 Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running
in a separate JVM) running on the same machine.  The DataNode will satisfy the single disk
block request by sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java,
this is done in the sendChunk() method, relying on Java's transferTo() method.  Depending
on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev()
syscall or a pair of mmap() and write().  In either case, each chunk is read from the disk
by issuing a separate I/O operation for each chunk.  The result is that the single request
for a 64-MB block ends up hitting the disk as over a thousand smaller requests for 64-KB each.
> Since the DFSClient runs in a different JVM and process than the DataNode, shuttling
data from the disk to the DFSClient also results in context switches each time network packets
get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send
operations).  Thus we see a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think providing
a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine.
 It could do this by examining the set of LocatedBlocks returned by the NameNode, marking
those that should be resident on the local host.  Since the DataNode and DFSClient (probably)
share the same hadoop configuration, the DFSClient should be able to find the files holding
the block data, and it could directly open them and send data back to the client.  This would
avoid the context switches imposed by the network layer, and would allow for much larger read
buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message