hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishwajeet Dusane (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
Date Tue, 22 Mar 2016 13:52:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206390#comment-15206390
] 

Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------


Participants: [~cnauroth], [~fabbri], [~hanm], [~twu], [~chris.douglas], [~vishwajeet.dusane],
[~shrikant].
 
*Meeting agenda* 
*   Discuss current comments and outstanding issues on HADOOP-12666
*   Approach of sub-classing webhdfs HDFS-9938
 
*Packaging:* 
* Given the three options that were discussed in HDFS-9938, Chris N mentioned he was okay
with the comments posted on the Jira. The option to maintain the current approach seems to
be the best way forward in the short term. This is based on the understanding that the current
approach is temporary and the client will evolve to be independent of WebHDFS.  The Jira will
be updated with this information.
* The initial approach to use WebHDFS was a good starting point,  but the reviewers feel that
it is good to evolve the ADL client independent of webHDFS.
* With the current approach changes in WebHDFS will impact the ADLS client.
* The recommendation was to publish a long term plan of having a solution independent of WebHDFS
and plan to target 2.9 for the separate ADL client (Long term plan)
* Having such a plan would make the community more comfortable accepting the current solution.
 
*Create-Append Semantics:*
* Discussed overall create/append semantics, there was a concern raised that this does not
ensure single writer semantics. .
* Chris N did mention that this is a deviation from HDFS semantics of enforcing single writer
semantics, and also stated that this is an approach taken by other cloud storage systems as
well e.g WASB and s3.
* There are some applications that do require this capability, typically these applications
start writing to the same path, on recovery (e.g Hbase). 
* File Systems like WASB have made  specific updates to address  the needs of certain applications
where handling multiple writers was an issue for e.g HBASE.  WASB has implemented a specific
lease implementation for HBASE
* The ADL Client implementation also implements a similar lease semantic for Hbase and this
is specifically done for createnonrecursive 
 ** 	It was clarified that the leaseid was generation by using a guid and there was an agreement
on this approach
 **	This information will be included in a separate document to be uploaded to the Jira (HADOOP-12666)
* Chris N did mention that the general guideline for applications is to have each instance
write data to its separate file and then commit by renaming it.
 
* All accepted comments have been included in the latest patch 
* Buffer Manager Race condition - has been fixed in the latest patch.
 
*Contract test cases for HDFS do not implement ACL related test cases, since none of the file
system extensions support  them*
** Would need to create new contract tests for ACLs.
 
* Overall across reviewers on the call there was no further objections to the core patches,
reviewers plan to complete one more review of the updated patches.
* HADOOP-12875 has been updated with a patch which includes an ability to run lives tests
using contract tests, new test cases have been added:
 
*Followup Action items*
* Share/upload document that covers 
** information on read semantics, read ahead buffering, Create/Append semantics 
** Lease id generation to be included in the document 
* Share an overall plan on the roadmap for the ADL client - essentially what is the plan for
removing the dependency on webhdfs (a  "+1" on the Jira will be contingent on publishing this
plan).Next step is for reviewers to complete the review of the new patch (Aaron to help with
Cloudera reviewers)
* Produce a page for alternative file systems
** Documents the differences to HDFS ; Example: HADOOP-12635
*  Attach Detailed documentation on file status cache (HADOOP-12876)

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: Create_Read_Hadoop_Adl_Store_Semantics.pdf, HADOOP-12666-002.patch,
HADOOP-12666-003.patch, HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch,
HADOOP-12666-007.patch, HADOOP-12666-008.patch, HADOOP-12666-009.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft Azure Data
Lake Store (ADL) from within Hadoop. This would enable existing Hadoop applications such has
MR, HIVE, Hbase etc..,  to use ADL store as input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich management and
security features. More details available at https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message