hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
Date Mon, 23 Jul 2012 17:06:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420788#comment-13420788
] 

Todd Lipcon commented on HDFS-3672:
-----------------------------------

Hey Suresh. I agree with all your points above.

One thing that's been talked about in the past is to consider using a local-only block pool
for MR temp storage. That would at least get one of the other major disk users going through
the same code paths.

The other idea we're thinking about is to expose disk statistics such as current queue length
and utilization for each local disk, up via the OS. We're still running some experiments locally,
but our assumption is that, within short time-scales (~0.5 seconds), the lagging 0.5 second
usage is a reasonably good predictor of the next 0.5 seconds, given most Hadoop-style access
is of 100MB+ chunks of data.

So, are you OK with introducing these as Unstable-annotated APIs, perhaps with an extra JavaDoc
warning that they are explicitly experimental and may cease to exist in the future?
                
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows clients to make
scheduling decisions for locality and load balancing. Extending this to also expose on which
disk on a datanode a block resides would enable even better scheduling, on a per-disk rather
than coarse per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but also involve
a series of RPCs to the responsible datanodes to determine disk ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message