hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
Date Mon, 25 Jan 2010 15:05:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804561#action_12804561

Jay Booth commented on HDFS-918:

Current patch is having some issues in terms of actual use -- seems to pass a lot of tests
but I'm having problems with ChecksumExceptions running actual MR jobs over it, so clearly
it's a work in progress still :)  I have a better patch that fixes some of the issues but
it's still not 100%, so I'll upload with some new tests once I resolve remaining issues. 
Currently passing TestPread, TestDistributedFileSystem, TestDataTransferProtocol and TestClientBlockVerification,
but still getting issues when actually running the thing -- if somebody has recommendations
for other tests to debug with, that'd be very welcome.


1)  On client connect to DataXCeiver server, dispatch a thread as per usual, except now the
thread is extremely short-lived.  It simply registers the connection with server.datanode.ReadServer
and dies.  (It would be more efficient to have some sort of accept loop that didn't spawn
a thread here, but I went with lowest impact integration)

2)  On register of a connection, ReadServer creates a Connection object and registers the
channel with a selector inside of ReadServer.ResponseWriter.  ResponseWriter maintains an
ArrayBlockingQueue<Connection> workQueue and polls the selector, cancelling keys and
adding connections which are ready for write to the work queue.  ReadServer also maintains
a BlockChannelPool, which is a pool of BlockChannel objects -- each BlockChannel represents
the file and meta-file for a given block.

3)  A small Handler pool takes items off of this work queue and calls connection.sendPacket(buffer,
channelPool).  Each handler maintains a DirectByteBuffer, instantiated at startup time, which
it uses for all requests.  

4)  Connection.sendPacket(buffer, channelPool) consults internal state about what needs to
be sent next (response headers, packet headers, checksums, bytes) and sends what it can, updating
internal state variables.  Uses the provided buffer and channelPool to do its work.  Uses
transferTo unless the config property for transferTo is disabled.  Right now it actually sends
2 packets per packet (header+sums and then bytes), once I resolve all correctness bugs it
may be worth combining the two into one packet for small reads.

5)  After work has been done and internal state updated (even if only 1 byte was sent), Handler
re-registers the Connection with ResponseWriter for further writes, or closes it if we're

Once I have this fully working, I'd expect CPU savings from fewer long-running threads and
less garbage collection of buffers, perhaps a small performance boost from the select-based
architecture and using DirectByteBuffers instead of HeapByteBuffers, and a slight reduction
in IOWAIT time under some circumstances because we're pooling file channels rather than re-opening
for every request.  It should also consume far fewer xceiver threads and open file handles
while running -- the pool is capped, so if we start getting crazy numbers of requests, we'll
close/re-open files as necessary to stay under the cap.

As I said, I made the DataXCeiver thread for opReadBlock register the channel and then die.
 This is probably the best way to go even though it's not optimal from a performance standpoint.
 Unfortunately, since DataXCeiver threads close their sockets when they die, I had to put
a special boolean case 'skipClose' to avoid that if op == Op.READ, which is kind of ugly --
recommendations are welcome for what to do here.

Also, as I noted earlier, the BlockChannelPool requires an instance of FSDataset to function,
rather than FSDatasetInterface, because Interface doesn't supply any getFile() methods, just
getInputStream() methods.  Probably the best way to handle this for tests would be to have
the SimulatedFSDataset write actual files to /tmp somewhere and provide handles to those files
when running tests.  Any thoughts from anyone?

Once I get this working, it might be worth exploring using this as a mechanism for repeated
reads over the same datanode connection, which could give some pretty big performance gains
to certain applications.  Upon completion of a request, the ReadServer could simply put the
connections in a pool that's polled every so often for isReadable() -- if it's readable, read
the request and re-register with the ResponseWriter.

Along those lines, once we get there it might wind up being simpler to do the initial read-request
through that mechanism as well, which would mean that we could get rid of some of the messy
integration with DataXceiverServer -- however, it would require opening up another port just
for reads.  What are people's thoughts on that?  I won't make that a goal of this patch but
I'd be curious as to people's thoughts regarding separate ports for read and write (and maybe
op_transfer_block could be handled via IPCServer) -- it could make things simpler in some
ways while making them more of a pain in others.

> Use single Selector and small thread pool to replace many instances of BlockSender for
> --------------------------------------------------------------------------------------------
>                 Key: HDFS-918
>                 URL: https://issues.apache.org/jira/browse/HDFS-918
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Jay Booth
>             Fix For: 0.22.0
>         Attachments: hdfs-multiplex.patch
> Currently, on read requests, the DataXCeiver server allocates a new thread per request,
which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage
by the sending threads.  If we had a single selector and a small threadpool to multiplex request
packets, we could theoretically achieve higher performance while taking up fewer resources
and leaving more CPU on datanodes available for mapred, hbase or whatever.  This can be done
without changing any wire protocols.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message