hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "RJ Nowling (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-6116) RFC: Make getFileBlockLocations part of the public WebHDFS API
Date Tue, 18 Mar 2014 18:13:46 GMT
RJ Nowling created HDFS-6116:
--------------------------------

             Summary: RFC: Make getFileBlockLocations part of the public WebHDFS API
                 Key: HDFS-6116
                 URL: https://issues.apache.org/jira/browse/HDFS-6116
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: webhdfs
            Reporter: RJ Nowling


Other projects such as Disco, a MapReduce framework written in Erlang / Python, want to support
the HDFS file system.  WebHDFS provides a great means of doing so, but it does not provide
information about data locality as part of the public API.  Information about data locality
is important for scheduling I/O operations and tasks efficiently.

HDFS-2340 added support for getFileBlockLocations, but there is no mention of this support
in the API documentation.  Comments in the source indicate that this is a private API.

The WebHDFS API redirects I/O requests to the datanode containing the first block of the request.
 Knowing the block size and file size, this feature can be abused to query data locality information,
but it will require multiple requests to the namenode which will add unnecessary overhead.

Thoughts:
1) Why is getFileBlockLocations private?  
2) If there is no good reason, can we make it public?
3) If there are problems that keep it private, can we design an API that could be used by
external users to more efficiently handle data locality issues?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message