Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Message-ID: <1139218872.12401264431935036.JavaMail.jira@brutus.apache.org>
Date: Mon, 25 Jan 2010 15:05:35 +0000 (UTC)
From: "Jay Booth (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Subject: [jira] Commented: (HDFS-918) Use single Selector and small thread
 pool to replace many instances of BlockSender for reads
In-Reply-To: <1931856806.5691264281021327.JavaMail.jira@brutus.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804561#action_12804561 ] 

Jay Booth commented on HDFS-918:
--------------------------------

Current patch is having some issues in terms of actual use -- seems to pass a lot of tests but I'm having problems with ChecksumExceptions running actual MR jobs over it, so clearly it's a work in progress still :)  I have a better patch that fixes some of the issues but it's still not 100%, so I'll upload with some new tests once I resolve remaining issues.  Currently passing TestPread, TestDistributedFileSystem, TestDataTransferProtocol and TestClientBlockVerification, but still getting issues when actually running the thing -- if somebody has recommendations for other tests to debug with, that'd be very welcome.

ARCHITECTURE

1)  On client connect to DataXCeiver server, dispatch a thread as per usual, except now the thread is extremely short-lived.  It simply registers the connection with server.datanode.ReadServer and dies.  (It would be more efficient to have some sort of accept loop that didn't spawn a thread here, but I went with lowest impact integration)

2)  On register of a connection, ReadServer creates a Connection object and registers the channel with a selector inside of ReadServer.ResponseWriter.  ResponseWriter maintains an ArrayBlockingQueue<Connection> workQueue and polls the selector, cancelling keys and adding connections which are ready for write to the work queue.  ReadServer also maintains a BlockChannelPool, which is a pool of BlockChannel objects -- each BlockChannel represents the file and meta-file for a given block.

3)  A small Handler pool takes items off of this work queue and calls connection.sendPacket(buffer, channelPool).  Each handler maintains a DirectByteBuffer, instantiated at startup time, which it uses for all requests.  

4)  Connection.sendPacket(buffer, channelPool) consults internal state about what needs to be sent next (response headers, packet headers, checksums, bytes) and sends what it can, updating internal state variables.  Uses the provided buffer and channelPool to do its work.  Uses transferTo unless the config property for transferTo is disabled.  Right now it actually sends 2 packets per packet (header+sums and then bytes), once I resolve all correctness bugs it may be worth combining the two into one packet for small reads.

5)  After work has been done and internal state updated (even if only 1 byte was sent), Handler re-registers the Connection with ResponseWriter for further writes, or closes it if we're done.

Once I have this fully working, I'd expect CPU savings from fewer long-running threads and less garbage collection of buffers, perhaps a small performance boost from the select-based architecture and using DirectByteBuffers instead of HeapByteBuffers, and a slight reduction in IOWAIT time under some circumstances because we're pooling file channels rather than re-opening for every request.  It should also consume far fewer xceiver threads and open file handles while running -- the pool is capped, so if we start getting crazy numbers of requests, we'll close/re-open files as necessary to stay under the cap.


INTEGRATION
As I said, I made the DataXCeiver thread for opReadBlock register the channel and then die.  This is probably the best way to go even though it's not optimal from a performance standpoint.  Unfortunately, since DataXCeiver threads close their sockets when they die, I had to put a special boolean case 'skipClose' to avoid that if op == Op.READ, which is kind of ugly -- recommendations are welcome for what to do here.

Also, as I noted earlier, the BlockChannelPool requires an instance of FSDataset to function, rather than FSDatasetInterface, because Interface doesn't supply any getFile() methods, just getInputStream() methods.  Probably the best way to handle this for tests would be to have the SimulatedFSDataset write actual files to /tmp somewhere and provide handles to those files when running tests.  Any thoughts from anyone?


FUTURE
Once I get this working, it might be worth exploring using this as a mechanism for repeated reads over the same datanode connection, which could give some pretty big performance gains to certain applications.  Upon completion of a request, the ReadServer could simply put the connections in a pool that's polled every so often for isReadable() -- if it's readable, read the request and re-register with the ResponseWriter.

Along those lines, once we get there it might wind up being simpler to do the initial read-request through that mechanism as well, which would mean that we could get rid of some of the messy integration with DataXceiverServer -- however, it would require opening up another port just for reads.  What are people's thoughts on that?  I won't make that a goal of this patch but I'd be curious as to people's thoughts regarding separate ports for read and write (and maybe op_transfer_block could be handled via IPCServer) -- it could make things simpler in some ways while making them more of a pain in others.

> Use single Selector and small thread pool to replace many instances of BlockSender for reads
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-918
>                 URL: https://issues.apache.org/jira/browse/HDFS-918
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Jay Booth
>             Fix For: 0.22.0
>
>         Attachments: hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads.  If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever.  This can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.