hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data
Date Wed, 07 Oct 2009 06:58:32 GMT

     [ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HDFS-347:

    Attachment: local-reads-doc

Attaching v1 of a design document for this feature. This does not include a test plan - that
will follow once implementation has gone a bit further. Pasting the design doc below as well:


h1. Design Document: Local Read Optimization

h2. Problem Definition

Currently, when the DFS Client is located on the same physical node as the DataNode serving
the data, it does not use this knowledge to its advantage. All blocks are read through the
same protocol based on a TCP connection. Early experimentation has shown that this has a 20-30%
overhead when compared with reading the block files directly off the local disk.

This JIRA seeks to improve the performance of node-local reads by providing a fast path that
is enabled in this case. This case is very common, especially in the context of MapReduce
jobs where tasks are scheduled local to their data.

Although writes are likely to see an improvement here too, this JIRA will focus only on the
read path. The write path is significantly more complicated due to write pipeline recovery,
append support, etc. Additionally, the write path will still have to go over TCP to the non-local
replicas, so the throughput improvements will probably not be as marked.

h2. Use Cases

# As mentioned above, the majority of data read during a MapReduce job tends to be from local
datanodes. This optimization should improve MapReduce performance of read-constrained jobs
# Random reads should see a significant performance benefit with this patch as well. Applications
such as the HBase Region Server should see a very large improvement.

Users will not have to make any specific changes to use the performance improvement - the
optimization should be transparent and retain all existing semantics.

h2. Interaction with Current System

This behavior needs modifications in two areas:

h3. DataNode

The datanode needs to be extended to provide access to the local block storage to the reading

h3. DFSInputStream

DFSInputStream needs to be extended in order to enable the fast read path when reading from
local datanodes.

h2. Requirements

h3. Unix Domain Sockets via JNI

In order to maintain security, we cannot simply have the reader access blocks through the
local filesystem. The reader may be running as an arbitrary user ID, and we should not require
world-readable permissions on the block storage.

Unix domain sockets offer the ability to transport already-open file descriptors from one
peer to another using the "ancillary data" construct and the sendmsg(2) system call. This
ability is documented in unix(7) under the SCM_RIGHTS section.

Unix domain sockets are unfortunately not available in Java. We will need to employ JNI to
access the appropriate system calls.

h3. Modify DFSClient/DataNode interaction

The DFS Client will need to be able to initiate the fast path read when it detects it is connecting
to a local DataNode. The DataNode needs to respond to this request by providing the appropriate
file descriptors or by reverting to the normal slow path if the functionality has been administratively
disabled, etc.

h2. Design

h3. Unix Domain Sockets in Java

The Android open source project currently includes support for Unix Domain Sockets in the
android.net package. It also includes the native JNI code to implement these classes. Android
is Apache 2.0 licensed and thus we can freely use the code in Hadoop.

The Android project relies on a lot of custom build infrastructure and utility functions.
In order to reduce our dependencies, we will copy the appropriate classes into a new org.apache.hadoop.net.unix
package. We will include the appropriate JNI code in the existing libhadoop library. If HADOOP-4998
(native runtime library for Hadoop) progresses in the near term, we could include this functionality

The JNI code needs small modifications to work properly in the Hadoop build system without
pulling in a large number of Android dependencies.

h3. Fast path initiation

When DFSInputStream is connecting to a node, it can determine whether that node is local by
simply inspecting the IP address. In the event that it is a local host and the fast path has
not been prohibited by the Configuration, the fast path will be initiated. The fast path is
simply a different BlockReader implementation.

h3. Fast path interface

BlockReader will become an interface, with the current implementation being renamed to RemoteBlockReader.
The fast-path for local reads will be a LocalBlockReader, which is instantiated after it has
been determined that the target datanode is local.

h3. Fast path mechanism

Currently, when the DFSInputStream connects to the DataNode, it sends OP_READ_BLOCK, including
the access token, block id, etc. Instead, when the fast path is desired, the client will take
the following steps:

# Opens a unix socket listening in the in-memory socket namespace. The socket's name will
be identical to the clientName already available in the input stream, plus a unique ID for
this specific input stream (so that parallel local readers function without collision).
# Sends a new opcode OP_CONNECT_UNIX. This operation takes the same parameters as OP_READ_BLOCK,
but indicates to the datanode that the client is looking for a local connection.
# The datanode performs the same access token and block validity checks as it currently does
for the OP_READ_BLOCk case. Thus the security model of the current implementation is retained.
# If the datanode refuses for any reason, it responds over the block transceiver protocol
with the same error mechanism as the current approach. If the checks pass:
## DN connects to the client via the unix socket.
## DN opens the block data file and block metadata file
## DN extracts the FileDescriptor objects from these InputStreams, and sends them as ancillary
data on the unix domain socket. It then closes its side of the unix domain socket.
## DN sends an "OK" response via the TCP socket.
## If any error happens during this process, it sends back an error response.
# On the client side, if an error response is received from the OP_CONNECT_UNIX request, the
client will mark a flag indicating that it should no longer try the fast path, and then fall
back to the existing BlockReader.
# If an OK response is received, the client constructs a LocalBlockReader (LBR).
## The LBR reads from the unix domain socket to receive the block data and metadata file descriptors.
## At this point, both the TCP socket and the unix socket can be closed; the file descriptors
remain valid once they have been received despite any closed sockets.
## The LBR then provides the BlockReader interface by simply calling seek(), read(), etc,
on an input stream constructed from these file descriptors.
## Some refactoring may occur here to try to share checksum verification code between the
LocalBlockReader and RemoteBlockReader.

The reason for the connect-back protocol rather than having the datanode simply listen on
a unix socket is to simplify the integration path. In order to listen on a socket, the datanode
would need an additional thread to spawn off transceivers. Additionally, it allows for a way
to verify that the client is in fact reading from the datanode on the target host/port without
relying on some conventional socket path.

h3. DFS Read semantics clarification

Before embarking on the above, the DFS Read semantics should be clarified. The error handling
and retry semantics in the current code are quite unclear. For example, there is significant
discussion in HDFS-127 that indicates a lot of confusion about proper behavior.

Although the work is orthogonal to this patch, it will be quite beneficial to nail down the
semantics of the existing implementation before attempting to add onto it. I propose this
work be done in a separate JIRA concurrently with discussion on this one, with the two pieces
of work to be committed together if possible. This will keep the discussion here on-point
and avoid digression into discussion of existing problems like HDFS-127.

h2. Failure Analysis

As described above, if any failure or exception occurs during the establishment of the fast
path, the system will simply fall back to the existing slow path.

One issue that is currently unclear is how to handle IOExceptions on the underlying blocks
when the read is being performed by the client. See *Work Remaining* below.

h2. Security Analysis

Since the block open() call is still being performed by the datanode, there is no loss of
security with this patch. AccessToken checking is performed by the datanode in the same manner
as currently exists. Since the blocks can be opened read-only, the recipient of the file descriptors
cannot perform unwanted modification.

h2. Platform Support

Passing file descriptors over Unix Domain Sockets is supported on Linux, BSD, and Solaris.
There may be some differences in the different implementations. The first version of this
JIRA should target Linux only, and automatically disable itself on platforms where it will
not function correctly. Since this is an optimization and not a new feature (the slow path
will continue to be supported) I believe this is OK.

h2. Work already completed

h3. Initial experimentation

The early work in HDFS-347 indicated that the performance improvements of this patch will
be substantial. The experimentation modified the BlockReader to "cheat" and simply open the
stored blocks with standard file APIs, which had been chmodded world readable. This improved
read of a 1GB from 8.7 seconds to 5.3 seconds, and improved random IO performance by a factor
of more than 30.

h3. Local Sockets and JNI Library

I have already ported the local sockets JNI code from the Android project into a local branch
of the Hadoop code base, and written simple unit tests to verify its operation. The JNI code
compiles as part of libhadoop, and the Java side uses the existing NativeCodeLoader class.
These patches will become part of the Common project.

h3. DFSInputStream refactoring

To aid in testing and understanding of the code, I have refactored DFSInputStream to be a
standalone class instead of an inner class of DFSClient. Additionally, I have converted BlockReader
to an interface and renamed BlockReader to RemoteBlockReader. In the process I also refactored
the contents of DFSInputStream to clarify the failure and retry semantics. This work should
be migrated to another JIRA as mentioned above.

h3. Fast path initiation and basic operation

I have implemented the algorithm as described above and added new unit tests to verify operation.
Basic unit tests are currently passing using the fast path reads.

h2. Work Remaining / Open Questions

h3. Checksums

The current implementation of LocalBlockReader does not verify checksums. Thus, some unit
tests are not passing. Some refactoring will probably need to be done to share the checksum
verification code between LocalBlockReader and RemoteBlockReader.

h3. IOException handling

Given that the reads are now occuring directly from the client, we should investigate whether
we need to add any mechanism for the client to report errors back to the DFS. The client can
still report checksum errors in the existing mechanism, but we may need to add some method
by which it can report IO Errors (e.g. due to a failing volume). I do not know the current
state of volume error tracking in the datanode; some guidance here would be appreciated.

h3. Interaction with other features (e.g. Append)

We should investigate whether (and how) this feature will interact with other ongoing work,
in particular appends. If there is any complication, it should be straightforward to simply
disable the fast path for any blocks currently under construction. Given that the primary
benefit for the fast path is in mapreduce jobs, and mapreduce jobs rarely run on under-construction
blocks, this seems reasonable and avoids a lot of complexity.

h3. Timeouts

Currently, the JNI library has some TODO markings for implementation of timeouts on various
socket operations. These will need to be implemented for proper operation.

h3. Benchmarking

Given that this is a performance patch, benchmarks of the final implementation should be done,
covering both random and sequential IO.

h3. Statistics/metrics tracking

Currently, the datanode maintains metrics about the number of bytes read and written. We no
longer will have accurate information unless we make reports back from the client. Alternatively,
the datanode can use the "length" parameter of OP_READ_UNIX and assume that the client will
always read the entirety of data it has requested. This is not a fair assumption, but the
approximation may be fine.

h3. Audit Logs/ClientTrace

Once the DN has sent a file descriptor for a block to the client, it is impossible to audit
the byte offsets that are read. It is possible for a client to request read access to a small
byte range of a block, receive a socket, and then proceed to read the entire block. We should
investigate whether there is a requirement for byte-range granularity on audit logs and come
up with possible solutions (eg disabling fast path for non-whole-block reads).

h3. File Descriptors held by Zombie Processes

In practice on some clusters, DFSClient processes can stick around as zombie processes. In
the TCP-based DFSClient, these zombie connections are eventually timed out by the DN server.
In this proposed JIRA, the file descriptors would be already transferred, and thus would be
stuck open on the zombie. This will not block file deletion, but does block the reclaiming
of the blocks on the underlying file system. This may cause problems on HDFS instances with
a lot of block churn and a bad zombie problem. Dhruba can possibly elaborate here.

h3. Determining local IPs

In order to determine when to attempt the fast path, the DFSClient needs to know when it is
connecting to a local datanode. This will rarely be a loopback IP address, so we need some
way of determining which IPs are actually local. This will probably necessitate an additional
method or two in NetUtils in order to inspect the local interface list, with some caching

> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: George Porter
>         Attachments: HADOOP-4801.1.patch, HADOOP-4801.2.patch, HADOOP-4801.3.patch, local-reads-doc
> One of the major strategies Hadoop uses to get scalable data processing is to move the
code to the data.  However, putting the DFS client on the same physical node as the data blocks
it acts on doesn't improve read performance as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due
to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary.
 Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running
in a separate JVM) running on the same machine.  The DataNode will satisfy the single disk
block request by sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java,
this is done in the sendChunk() method, relying on Java's transferTo() method.  Depending
on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev()
syscall or a pair of mmap() and write().  In either case, each chunk is read from the disk
by issuing a separate I/O operation for each chunk.  The result is that the single request
for a 64-MB block ends up hitting the disk as over a thousand smaller requests for 64-KB each.
> Since the DFSClient runs in a different JVM and process than the DataNode, shuttling
data from the disk to the DFSClient also results in context switches each time network packets
get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send
operations).  Thus we see a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think providing
a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine.
 It could do this by examining the set of LocatedBlocks returned by the NameNode, marking
those that should be resident on the local host.  Since the DataNode and DFSClient (probably)
share the same hadoop configuration, the DFSClient should be able to find the files holding
the block data, and it could directly open them and send data back to the client.  This would
avoid the context switches imposed by the network layer, and would allow for much larger read
buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message