hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
Date Wed, 01 Aug 2012 13:19:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426598#comment-13426598

Arun C Murthy commented on HDFS-3672:

Todd - first of all, no one is *blocking* anything. 

bq. Hey Suresh. I'll try to answer a few of your questions above from the perspective of HBase
and MR.

This jira was started with the premise that this new *feature* was useful to MapReduce and
HBase (http://s.apache.org/NJY). So, I assumed there would be some work in that direction.

If that was the case I don't see how doing the suggestion to do the work in a dev-branch before
merging to mainline is *blocking* anything? It is something we have done many times over for
YARN, HDFS HA etc. etc.

Personally, if anyone was doing this work on MR, I'd be very interested in collaborating,
heck - *learning*. 

However, given my experience on MR, I'd classify it as a high-risk, but very, very interesting
research since on a mid-sized clusters (few hundred nodes) and beyond the scheduling overhead
might more than negate the I/O gains. Hence, again, doing that in a dev-branch is absolutely
the right thing to do from a project and risk management perspective.

bq. This isn't the first time an API has been added to the trunk code before downstream users

Yes, this wouldn't be the first time we made *that* mistake. 

Clearly, we are dealing with the consequences of our previous mistakes for a while now. Arguing
*that* is a good reason to do the same, again, is not cogent.

bq.  As I mentioned above, we have at least one customer who would like to use this feature
in their code to get better disk efficiency. They need to run against an actual release, not
a dev branch build. This is the primary use case we're targeting right now. I want to be perfectly
honest: the HBase/MR examples I gave above are not on our immediate roadmap; they just serve
as proof that this isn't a one-off/niche improvement.

Now, clearly, you don't plan to do any work on either HBase or MR anytime soon and you have
a different roadmap for a client.

If you had made that clear sooner, the conversation would be different.

Essentially, for the foreseeable future this will be *dead* code which is not going to be
beneficial to anyone in the community... yet, the burden of maintenance etc. will remain.

No, that is not a big deal since this particular change has a fairly small cross-section -
it might be harder to make the argument for a future, more extensive change of this *kind*.
Clearly, if it's a plugin etc., its easier to digest.

IAC, I don't wish to debate this further. 


Importantly, we should switch this *feature* off by default so that people who use this understand
that this isn't necessarily supported - at least until we have a real, use-case for this in
the community.
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch
> Currently, HDFS exposes on which datanodes a block resides, which allows clients to make
scheduling decisions for locality and load balancing. Extending this to also expose on which
disk on a datanode a block resides would enable even better scheduling, on a per-disk rather
than coarse per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but also involve
a series of RPCs to the responsible datanodes to determine disk ids.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message