hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
Date Fri, 12 Mar 2010 20:17:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844658#action_12844658
] 

Jay Booth commented on HDFS-918:
--------------------------------

>>> I think it is very important to have separate pools for each partition.
>> This would be the case if I were using a fixed-size thread pool and a LinkedBlockingQueue
- but I'm not, see Executors.newCachedThreadPool(),

>hmm.. does it mean that if you have thousand clients and the load is disk bound, we end
up with 1000 threads?

Yeah, although it'll likely turn out to be less than 1000 in practice..   If the requests
are all short-lived, it could be significantly less than 1000 threads when you consider re-use,
if it's 1000 long reads, it'll probably wind up being only a little less if at all.  The threads
themselves are really lightweight, the only resources attached to them are a ThreadLocal<ByteBuffer(8096)>.
  (8k seemed ok for the ByteBuffer because the header+checksums portion is always significantly
less than that, and the main block file transfers are done using transferTo).

I chose this approach after initially experimenting with a fixed-size threadpool and LinkedBlockingQueue
because the handoff is faster and every pending IO request is guaranteed to become an actual
disk-read syscall waiting on the operating system as fast as possible.  This way, the operating
system decides which disk request to fulfill first, taking advantage of the lower-level optimizations
around disk IO.  Since the threads are pretty lightweight and the lower-level calls do a better
job of optimal fulfillment, I think this will work better than a fixed-size threadpool, where
for example, 2 adjacent reads from separate threads could be separated from each other in
time whereas the disk controller might fulfill both simultaneously and faster.  This becomes
even more important, I think, with the higher 512kb packet size -- those are big chunks of
work per-sycall that can be optimized by the underlying OS.  Regarding the extra resource
allocation for the threads -- if we're disk-bound, then generally speaking a few extra memory
resources shouldn't be a huge deal -- the gains from dispatching more disk requests in parallel
should outweigh the memory allocation and context switch costs.

The above is all in theory -- I haven't benchmarked parallel implementations head-to-head.
 But certainly for random reads, and likely for longer reads, this approach should get the
syscall invoked as fast as possible.  Switching between the two models would be pretty simple,
just change the parameters we pass to the constructor for new ThreadPoolExecutorService().

> Use single Selector and small thread pool to replace many instances of BlockSender for
reads
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-918
>                 URL: https://issues.apache.org/jira/browse/HDFS-918
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Jay Booth
>             Fix For: 0.22.0
>
>         Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch,
hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread per request,
which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage
by the sending threads.  If we had a single selector and a small threadpool to multiplex request
packets, we could theoretically achieve higher performance while taking up fewer resources
and leaving more CPU on datanodes available for mapred, hbase or whatever.  This can be done
without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message