hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
Date Thu, 10 Mar 2016 17:24:41 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189587#comment-15189587
] 

Chris Nauroth commented on HADOOP-12666:
----------------------------------------

The create/append/flush sequence is hugely different behavior.  At the protocol layer, there
is the addition of the flush parameter, which is a deviation from stock WebHDFS.  Basically
any of the custom *Param classes represent deviations from WebHDFS protocol: leaseId, ADLFeatureSet,
etc.

At the client layer, the aggressive client-side caching and buffering in the name of performance
creates different behavior from stock WebHDFS.  I and others have called out that while perhaps
you don't observe anything to be broken right now, that's no guarantee that cache consistency
won't become a problem for certain applications.  This is not a wire protocol difference,
but it is a significant deviation in behavior from stock WebHDFS.

At this point, it appears that the ADL protocol, while heavily inspired by the WebHDFS protocol,
is not really a compatible match.  It is its own protocol with its own unique requirements
for clients to use it correctly and use it well.  Accidentally connecting the ADL client to
an HDFS cluster would be disastrous.  The create/append/flush sequence would cause massive
unsustainable load to the NameNode in terms of RPC calls and edit logging.  Client write latency
would be unacceptable.  Likewise, accidentally connecting the stock WebHDFS client to ADL
seems to yield unacceptable performance for ADL.

It is these large deviations that lead me to conclude the best choice is a dedicated client
distinct from the WebHDFS client code.  Having full control of that client gives us the opportunity
to provide the best possible user experience with ADL.  As I've stated before though, I can
accept a short-term plan of some code reuse with the WebHDFS client.

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch,
HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-007.patch, HADOOP-12666-008.patch,
HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft Azure Data
Lake Store (ADL) from within Hadoop. This would enable existing Hadoop applications such has
MR, HIVE, Hbase etc..,  to use ADL store as input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich management and
security features. More details available at https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message