hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
Date Mon, 30 Jul 2012 22:58:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425344#comment-13425344
] 

Todd Lipcon commented on HDFS-3672:
-----------------------------------

bq. I'll ask again since I didn't get a response - wouldn't it make sense to commit this patch
to a dev-branch. Use that to prototype changes to either MapReduce or HBase and then merge
it in?

There are projects outside of just HBase and MapReduce that would like to run against this,
some of which are not Apache projects. As I mentioned above, we have at least one customer
who would like to use this feature in their code to get better disk efficiency. They need
to run against an actual release, not a dev branch build. This is the primary use case we're
targeting right now. I want to be perfectly honest: the HBase/MR examples I gave above are
not on our immediate roadmap; they just serve as proof that this isn't a one-off/niche improvement.

The other downside with a dev branch is that it's difficult for downstream OSS projects to
integrate against something that's not in a release. HBase already has to build against several
different Maven profiles to support 1.0, 0.23, and 2.0. Adding another profile against a dev
branch not available in maven is not feasible.

This isn't the first time an API has been added to the trunk code before downstream users
exist. For example, FileContext was in Hadoop for somewhere around a year before MR2 started
to migrate to it. The "New MR API" is still barely used based on my discussions with users.
If there is sufficient motivation (plus customer demand) for an API, and the API is explicitly
marked Unstable, what's the problem with including it? It's entirely new code and has no risk
of destabilizing the existing feature set.

I fear that blocking APIs like this from Apache will only serve to fracture the Hadoop user
base, pushing us back towards the 0.20-era nightmare of distinct distros with distinct non-overlapping
capabilities.

Do you have a technical objection to the new code: for example, a reason why it will destabilize
the existing feature set?
                
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows clients to make
scheduling decisions for locality and load balancing. Extending this to also expose on which
disk on a datanode a block resides would enable even better scheduling, on a per-disk rather
than coarse per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but also involve
a series of RPCs to the responsible datanodes to determine disk ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message