hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
Date Fri, 27 Feb 2015 04:10:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339693#comment-14339693

Colin Patrick McCabe commented on HDFS-7836:

bq. These two sound contradictory. I assume the former is correct and we won't really take
the FSN lock. Also I did not get how you will process one stripe at a time without repeatedly
locking and unlocking, since DataNodes wouldn't know about the block to stripe mapping to
order the reports. I guess I will wait to see the code. 

Let me clarify a bit.  It may be necessary to take the FSN lock very briefly once at the start
or end of the block report, but in general, we do not want to do the block report under the
FSN lock as it is done currently.  The main processing of blocks should not happen under the
FSN lock.

The idea is to separate the BlocksMap into multiple stripes.  Which blocks go into which stripes
is determined by blockID.  Something like "blockID mod 5" would work for this.  Clearly, the
incoming block report will contain work for several stripes.  As you mentioned, that necessitates
locking and unlocking.  But we can do multiple blocks each time we grab a stripe lock.  We
simply accumulate a bunch of blocks that we know are in each stripe in a per-stripe buffer
(maybe we do 1000 blocks at a time in between lock release... maybe a little more).  This
is something that we can tune to make a tradeoff between latency and throughput.  This also
means that we will have to be able to handle operations on blocks, new blocks being created,
etc. during a block report.

bq. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with a hypothetical
10TB disk and a low average block size of 10MB, we get 1M blocks/disk in the foreseeable future.
With splitting that gets you to a reasonable ~7MB block report per disk. I am not saying no
to chunking/compression but it would be useful to see some perf comparison before we add that

1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block report (did I
do that math right?)  That's definitely bigger than we'd like the full BR RPC to be, and compression
can help here.  Or possibly separating the block report into multiple RPCs.  Perhaps one RPC
per storage?

Incidentally, there's nothing magic about the 64 MB RPC size.  I originally added that number
as the max RPC size when I did HADOOP-9676, just by choosing a large power of 2 that seemed
way too big for a real, non-corrupt RPC message. :)  But big RPCs are bad in general because
of the way Hadoop IPC works.  To improve our latency, we probably want a block report RPC
size much smaller 64 MB.

> BlockManager Scalability Improvements
> -------------------------------------
>                 Key: HDFS-7836
>                 URL: https://issues.apache.org/jira/browse/HDFS-7836
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Charles Lamb
>            Assignee: Charles Lamb
>         Attachments: BlockManagerScalabilityImprovementsDesign.pdf
> Improvements to BlockManager scalability.

This message was sent by Atlassian JIRA

View raw message