hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lei (Eddy) Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
Date Mon, 08 Feb 2016 22:59:40 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137935#comment-15137935
] 

Lei (Eddy) Xu commented on HADOOP-12666:
----------------------------------------

Hey, [~vishwajeet.dusane] 

Thanks for working on this nice patch.

Have a few questions, 

* {code:title=FileStatusCacheManager.java}
 * ACID properties are maintained in overloaded api in @see
 * PrivateAzureDataLakeFileSystem class.
{code}

* You mentioned in the above comments. But {{PrivateAzureDataLakeFileSystem}} does not call
it within synchronized calls (e.g., {{PrivateAzureDataLakeFileSystem#create}}.  Although {{syncMap}}
is a {{synchronizedMap}}, {{putFileStatus}} has multiple operations on {{syncMap}}, which
can not guarantee atomicity.

* It might be a better idea to provide atomicity in {{PrivateAzureDataLakeFileSystem}}. A
couple of places have multiple cache calls within the same function (e.g., {{rename()}}).

* It might be a good idea to rename {{FileStatusCacheManager#getFileStatus, putFileStatus,
removeFileStatus}} to {{get/put/remove}}, because the class name already clearly indicates
the context.

* {{FileStatusCacheObject}} can only store an absolute expiration time. And its methods can
be package-level methods.

* I saw a few places, e.g., {{PrivateAzureDataLakeFileSystem#rename/delete}}, that clear the
cache if the param is a directory. Could you justify the reason behind this? Would it cause
noticeable performance degradation?  Or as an alternative, using LinkedList + TreeMap for
FileStatusCacheManager?

* One general question, is this FileStatusCacheManager in {{HdfsClient}}? If it is the case,
how do you make them consistent across clients on multiple nodes?

* Similar to above question, could you provide a reference architecture of how to run a cluster
on Azure Data Lake?

* {code}
       if (b == null) {
          throw new NullPointerException();
        } else if (off < 0 || len < 0 || len > b.length - off) {
          throw new IndexOutOfBoundsException();
        } else if (len == 0) {
          return 0;
        }
{code}

Can we use {{Precondtions}} here? It will be more descriptive. 

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch,
HADOOP-12666-005.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft Azure Data
Lake Store (ADL) from within Hadoop. This would enable existing Hadoop applications such has
MR, HIVE, Hbase etc..,  to use ADL store as input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich management and
security features. More details available at https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message