hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishwajeet Dusane (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
Date Tue, 09 Feb 2016 11:24:18 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138804#comment-15138804
] 

Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------

Thanks [~eddyxu] for the comments.

{quote}
* You mentioned in the above comments. But PrivateAzureDataLakeFileSystem does not call it
within synchronized calls (e.g., PrivateAzureDataLakeFileSystem#create. Although syncMap is
a synchronizedMap, putFileStatus has multiple operations on syncMap, which can not guarantee
atomicity.

* It might be a better idea to provide atomicity in PrivateAzureDataLakeFileSystem. A couple
of places have multiple cache calls within the same function (e.g., rename()).
{quote}

PutFileStatus has only 1 operation on syncMap. Could you please elaborate on the scenario
which could be affected? To be certain, are you reviewing to HADOOP-12666-005.patch right?

{quote}
* It might be a good idea to rename FileStatusCacheManager#getFileStatus, putFileStatus, removeFileStatus
to get/put/remove, because the class name already clearly indicates the context.
{quote}

 Agree. Renamed to get/put/remove

{quote}
* FileStatusCacheObject can only store an absolute expiration time. And its methods can be
package-level methods.
{quote}

You are right, this is an alternate approach to handle cache expiration time. I think we can
leave with current implementation using time to live check, Please let me know if you find
any issue with that approach?  

{quote}
* I saw a few places, e.g., PrivateAzureDataLakeFileSystem#rename/delete, that clear the cache
if the param is a directory. Could you justify the reason behind this? Would it cause noticeable
performance degradation? Or as an alternative, using LinkedList + TreeMap for FileStatusCacheManager?
{quote}

Yes, To avoid performance & correction issue when directory is renamed/deleted. In such
cases, Cache is holding stale entries and needs to be removed so that delete/rename followed
by getFileStatus call (For file/folder present in the directory). At the point of folder deletion,
Cache might be holding multiple FileStatus instances within directory. Its efficient to nuke
the cache and rebuild it than iterate over.

The current cache is a basic implementation to hold FileStatus instances to start with and
we would continue to enhance in upcoming changes.

{quote}
* One general question, is this FileStatusCacheManager in HdfsClient? If it is the case, how
do you make them consistent across clients on multiple nodes?
{quote}

FileStatusCacheManager need not be consistent across clients. FileStatusCacheManager is build
based on the ListStatus and GetFileStatus calls from the respective clients.

{quote}
* Can we use Precondtions here? It will be more descriptive.
{quote}

Are you referring to com.google.common.base.Preconditions? 


> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch,
HADOOP-12666-005.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft Azure Data
Lake Store (ADL) from within Hadoop. This would enable existing Hadoop applications such has
MR, HIVE, Hbase etc..,  to use ADL store as input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich management and
security features. More details available at https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message