hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
Date Thu, 11 Mar 2010 19:14:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844174#action_12844174

Jay Booth commented on HDFS-918:

Yeah, it only uses nonblocking pread ops on the block and block.meta files..  it sends the
packet header and checksums in one packet (maybe just part of the checksums if TCP buff was
full), then repeatedly makes requests to send PACKET_LENGTH (default 512kb) bytes until they're
sent.  When I had some trace logging enabled, I could see the TCP window scale up.. first
request sent 96k in a packet, then it scaled up to 512k per packet after a few.

Here's a (simplified) breakdown of the control structures and main loop:

DataXCeiverServer -- accepts conns, creates thread per conn
  From thread:
    read OP, blocking
    if we're a read request and multiplex enabled, delegate to MultiplexedBlockSender and
    otherwise instantiate DataXCeiver (which now takes op as an arg) and call xceiver.run()}}

MultiplexedBlockSender  // maintains ExecutorService, SelectorThread and exposes public register()
    register(Socket conn);  // configures nonblocking, sets up Connection object, dispatches
the first packet-send as an optimization, then puts the Future<Connection> in an inbox
for the selector thread
  SelectorThread  // maintains Selector, BlockingQueue<Future<Connection>> inbox,
LinkedList<Future<Connection>> processingQueue
    main loop:
      1)  pull all futures from inbox, add to processingqueue
      2)  iterate/poll/remove Futures from processingQueue, then close/re-register those that
finished sending a packet as appropriate (linear time, but pretty fast)
      3)  select
      4)  dispatch selected connections, add their Futures to processingQueue

  Connection.sendPacket(BlockChannelPool, ByteBuffer)  // workhorse method, invoked via Callable
     maintains a bunch of internal state variables per connection
     fetches BlockChannel object from BlockChannelPool -- BlockChannel only exposes p-read
methods for underlying channels
     buffers packet header and sums, sends, records how much successfully sent -- if less
than 100%, return and wait for writable
     tries to send PACKET_LENGTH bytes from main file via transferTo, if less than 100%, return
and wait for writable
     marks self as either FINISHED or READY, depending on if that was the last packet

Regarding file IO, I don't know if it's faster to send the packet header as it's own 13 byte
packet and use transferTo for the meta file, or to do what I'm doing now and buffer them into
one packet.  I feel like it'll be a wash..  or at any rate a minor difference because the
checksums are so much smaller than the main data.

What do people think about a test regime for this?  It's a really big set of changes but it
opens up a lot of doors (particularly connection re-use with that register() paradigm), seems
to perform equal/better depending on the case, gets a big win on open file descriptors and
factors all of the server-side protocol logic into one method, instead of spread out across
several classes.

I certainly understand being hesitant to commit such a big change without some pretty extensive
testing, but if anyone had any direction as to what they'd like to see tested, that'd be awesome.
 I'm already planning on setting up some disk-bound benchmarks now that I've tested network-bound
ones..  anything else that people want to see?  It seems to pass all unit tests, my last run
had a couple seemingly pre-existing failures but 99% of them passed.  I guess I should do
another full run and account for any that don't pass while I'm at it.

> Use single Selector and small thread pool to replace many instances of BlockSender for
> --------------------------------------------------------------------------------------------
>                 Key: HDFS-918
>                 URL: https://issues.apache.org/jira/browse/HDFS-918
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Jay Booth
>             Fix For: 0.22.0
>         Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch,
hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch
> Currently, on read requests, the DataXCeiver server allocates a new thread per request,
which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage
by the sending threads.  If we had a single selector and a small threadpool to multiplex request
packets, we could theoretically achieve higher performance while taking up fewer resources
and leaving more CPU on datanodes available for mapred, hbase or whatever.  This can be done
without changing any wire protocols.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message