hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishwajeet Dusane (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
Date Wed, 24 Feb 2016 13:17:19 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162960#comment-15162960

Vishwajeet Dusane commented on HADOOP-12666:

Extracted comments out 

* AF>This class seems to have serious issues that need addressing:
	1. Local race conditions in caller PrivateAzureDataLakeFileSystem
	[Vishwajeet] - Already explained in the previous replies to comments from [~cnauroth] and
[~eddyxu]. In the worst case, there would be a cache miss and this would not break any functionality.
	2. No mechanism for cache invalidation across nodes in the cluster.
	[Vishwajeet] - Scope of cache is per process. Invalidation across nodes would be taken care
as part of the distributed cache mechanism in the future.
 * AF>Update comment? This uses "adl" scheme, right?
 [Vishwajeet] - Yes
 * AF>What is this class used for? I didn't see any uses.
 [Vishwajeet] - No longer required. This will is removed. 
 * AF>Care to comment why this is in the ..hdfs.web package instead of fs.adl?
 [Vishwajeet] Already explained in the previous comments. In short, requirement to use protected
functionality within `WebHdfsFileSystem` implementation to disable redirect operation, Access
Runner class for configuration.. etc. 
 * AF>?
 [Vishwajeet] Comment was trimmed, updated again. 
 * AF>Due to the bug or due to the fix? The fix was merged in 2.8.0, right?
 * AF>I'm not understanding this last sentence, can you explain?
 [Vishwajeet] Not needed anymore since homedirectory would constructed locally and no back-end
call would be necessary. Updated comment accordingly.
 * AF>Is this a race condition?
	thread 1> getFileStatus(), cache miss
	super.getStatus -> s1
	cache.get() -> null
	thread 2> delete()
	thread 1> cache.put(s1)
	Maybe provide an atomic putIfAbsent() for FileStatusCacheManager. You can
	synchronize on the underlying map object I believe (see
[Vishwajeet] - Already using `syncMap = Collections.synchronizedMap(map);` We are aware of
the limitation of the current implementation. Cache is short lived and would not be persistent
effect. Is there a real user scenario where such issue can surface frequently ? We have recommended
to turn off the `FileStatus` cache feature in case of misbehavior.

 * AF> Seems like there is a less-likely race condition here. (f is replaced by a directory
after checking fs.isFile())
[Vishwajeet] Why there would be race condition?

 * AF>Similar pattern of get/mutate non-atomically repeats here and below.
 [Vishwajeet] Could you please elaborate the issue ? 
 * AF>typo
  [Vishwajeet] Thanks, corrected.
 * AF>Did you guys consider handling this as transparently reconnecting, instead
of doing separate connections for each op? Seems like performance would be a lot
  [Vishwajeet] Exactly the same. Since transparent re-connection would require to pass new
offset and length value again which is separate op. HTTP persistent connection ensures the
same socket is reused so no impact on the performance as well. We have done enough perf test
to ensure.
 * AF>I'd expect you to want a connection per-thread, instead of per-op.
 [Vishwajeet] Connection per-op
 * AF> This case could use some perf optimization. e.g. Three calls to get system time.
+ return fin; + }
 [Vishwajeet] Agree. Corrected.

 * AF> How about adding ADLLogger.logWithTimestamp(). That way, if the logger is disabled,
you don't keep getting system time.
 [Vishwajeet] Removed system.currentTimeinMs in case debug or perf flags are OFF.
 * AF> Redundant check of isLogEnabled() + ADLLogger.log("getFileBlockLocations }
[Vishwajeet] Corrected

 * AF>Just use "name" twice instead of defining host?
 [Vishwajeet] Mainly for readability on block location computation so would not change that.
 * AF> Why the runtime check of a compile-time constant? How about just add a comment near
the definition "must be non-zero" + throw new IllegalArgumentException( + "The block size
for the given file is not a positive number}
[Vishwajeet] Removed.

 * AF> Redundant check of isLogEnabled() + ADLLogger.log("getFileBlockLocations }
[Vishwajeet] Removed 

 * AF>Formatting. Missing newline above.
 [Vishwajeet] Using Apache.xml in intellij for formatting per apache guideline. Will look
at it later in rulset defined in Apache.xml and raise JIRA accordingly. 
 * AF>Why volatile here? Needs comments.
	I have a feeling this is wrong and you need some synchronized blocks below
 [Vishwajeet] - We removed synchronization block for performance reason and used Volatile
for specific variables which require synchronization. Problem could be when same FsInputSteam
instance being used across threads to read from the same file. Couple of variable does not
require to be volatile though, updated code.
 * AF>Probably unique. What is impact if this collides? Why not use a random java UUID?
 [Vishwajeet] - For performance reason we have removed UUID usage. Usage is for telemetry
purpose. It is used with other parameters to identity the request. For the overlapped value,
impact is none.
 * AF>this is not thread safe, but looking at FSInputStream, it appears read() is
 supposed to be.
Seems this class is racy in general?

 [Vishwajeet] - Assumption is FsInputSteam instance API would not be invoked from multiple
threads. Are you aware of such condition which we might encounter? 
 * AF>I didn't have time to review this part thoroughly.. Can take
better look next round.
[Vishwajeet] Looking forward for your comments.

 * AF>Is this shutdown() needed?
 [Vishwajeet] - Do you see issue any issue here ? 
* AF> Curious, why don't you care about timeout expiring? Can't you just remove this whole
try-catch block? The futures are what are ensuring tasks finish below, no? + }
[Vishwajeet] - Decision is to have time bound operation and if not complete in the duration
then fail fast so that application can retry. 

 * AF>Add function comment w/ param and @return explanation?
 [Vishwajeet] - Done
 * AF>Can you explain what this openid does?
 - [Vishwajeet] To be used by ADL back-end for telemetry.
 * AF> humm.. we expect in.read() to return 0, and then return -1? Also, you can remove
the string concatenation in the exception message. + }
 [Vishwajeet] We encountered this condition when requested n number of bytes match the available
n number of bytes however available bytes are (n+1), where +1 is end of network steam marker.
Mainly for reusing socket connection. if end of network steam marker is not read then next
urlConnection instance re initiate new socket connection to back-end server. 
 * AF> As noted before, using filesystem types to achieve logging tags is awkward.
 [Vishwajeet] Additional instrumentation is done in this class for capture perf data hence
the separate filesystem exposed. We need to rework on that, i will file separate JIRA for
the same.
 * AF>Again, need to put these in core-default.xml with documentation and
 [Vishwajeet] Yes, Will add it as part of the patch. All the configuration defined part of
this patch are already lowered case in Path 06, do let me know if you have found any discrepancy?

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch,
HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-1.patch
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft Azure Data
Lake Store (ADL) from within Hadoop. This would enable existing Hadoop applications such has
MR, HIVE, Hbase etc..,  to use ADL store as input or output.
> ADL is ultra-high capacity, Optimized for massive throughput with rich management and
security features. More details available at https://azure.microsoft.com/en-us/services/data-lake-store/

This message was sent by Atlassian JIRA

View raw message