hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2316) webhdfs: a complete FileSystem implementation for accessing HDFS over HTTP
Date Tue, 01 Nov 2011 23:05:32 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141727#comment-13141727

Tsz Wo (Nicholas), SZE commented on HDFS-2316:


> "<namenode>:<port>" and "http://<host>:<port>" seem to be used
interchangeably. We should be consistent where possible.

You are right.  I should use <host>:<port> only.

> Why doesn't "curl -i -L "http://<host>:<port>/webhdfs/<path>" just
work? Do we really need to specify op=OPEN for this very simple, common case?

The op parameter does not have a default value.  I think it may be confusing if we have a
default - If we forgot to add op parameter, then it becomes a totally different operation.

> I believe "http://<datanode>:<path>" should be "http://<datanode>:<port>"
in append.

Good catch!

> Need format of responses spelled out.
> It would be nice if we could document the possible error responses as well.

Will post a updated doc with JSON responses and error responses soon.

> Since a single datanode will be performing the write of a potentially large file, does
that mean that file will have an entire copy on that node (due to block placement strategies)?
That doesn't seem desirable..

It is probably the case.  We may change the block placement strategies as an improvement later

> Is a SHORT sufficient for buffersize?

It should be INT.

> Do we need a renewlease? How will very slow writers be handled?

A slow writer sends data to one of the datanodes using HTTP.  That datanode uses a DFSClient
to write data.  The DFSClient is going to renews lease for the writer.

> Once I have file block locations, can I go directly to those datanodes to retrieve rather
than using content_range and always following a redirect?

Yes.  Clients could get block locations, construct the URLs itself and then talk to the datanodes
directly.  We should have an API to support this.  E.g. GETFILEBLOCKLOCATIONS is better to
return a list of URLs directly.

GETFILEBLOCKLOCATIONS returns a LocatedBlocks structure which is not easy to use.  I am changing

> Do we need flush/sync?

Since the client is using HTTP, there is no way for them to call hflush.  Let's leave this
as a future improvement.

> webhdfs: a complete FileSystem implementation for accessing HDFS over HTTP
> --------------------------------------------------------------------------
>                 Key: HDFS-2316
>                 URL: https://issues.apache.org/jira/browse/HDFS-2316
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: WebHdfsAPI20111020.pdf
> We current have hftp for accessing HDFS over HTTP.  However, hftp is a read-only FileSystem
and does not provide "write" accesses.
> In HDFS-2284, we propose to have webhdfs for providing a complete FileSystem implementation
for accessing HDFS over HTTP.  The is the umbrella JIRA for the tasks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message