hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishwajeet Dusane (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
Date Thu, 25 Feb 2016 15:07:18 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167293#comment-15167293
] 

Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------

 [~fabbri] Thanks a lot for your comments.

h6. For FileStatus Cache - I agree on the race condition situations. My question about your
concern is, could it break any functionality in such a situation? and I think it would not
break any common functionality. Based on the variety of Hadoop applications we have executed
with this code.

So let me try to break down the discussion based on the scenarios.
** *What is FileStatus Cache?* 
	*** FileStatus cache is simple process level cache which mirrors backend storage FileStatus
objects.
	*** Time to live on the FileStatus cached object is limited. 5 seconds default and configurable
through core-site.xml
	*** FileStatus objects are stored in Synchronized LinkedHashMap. Where key is fully qualified
file path and value is FileStatus java object along with time to live information.
	*** FileStatus cache is built based on successful responses to GetFileStatus and ListStatus
calls for existing files/folders. Non existent files/folder are not maintained in the cache.
	*** FileStatus cache motivation is to avoid multiple GetFileStatus calls to the ADL backend
and as a result gain better performance for job startup and during execution.
I will try to break down in to some scenarios that may occur.
** *Scenario 1 : Concurrent get request for the same FileStatus object*
	*** Multiple threads trying to access same FileStatus object.
	Example: GetFileStatus call for path /a.txt from multiple threads within process when FileStatus
instance present in the cache.
	*** Should not be a problem, Valid FileStatus object is returned to caller across threads.
** *Scenario 2 : Concurrent put request for the same FileStatus object*
	*** Multiple threads updating same FileStatus object.
	{code:java}
	public String thread1()
	{
	    // FileStatus fileStatus - For storage filepath /a.txt 
		...
		fileStatusCacheManager.put(fileStatus,5); // Race condition
		...
	}
	...
	public String thread2()
	{
	    // FileStatus fileStatus - For storage file /a.txt 
		...
		fileStatusCacheManager.put(fileStatus,5); // Race condition
		...
	}
	{code}
	*** Whoever wins the race, Metadata for FileStatus instance would be constant for the same
file /a.txt
	*** Hence the latest and greatest value for /a.txt is valid value anyway.
** *Scenario 3 : Concurrent get/put request for the same FileStatus object*
	{code:java}
	public String thread1()
	{
	    // FileStatus fileStatus - For storage filepath /a.txt 
		...
		fileStatusCacheManager.put(fileStatus,5); // Race condition
		...
	}
	...
	public String thread2()
	{
	    Path f = new Path("/a.txt");
		...
		FileStatus fileStatus = fileStatusCacheManager.get(makeQualified(f)); // Race condition
		...
	}
	{code}
	*** Depending upon order of execution thread2 may or may not get latest value updated from
thread1. Even synchronization of blocks are not going to guarantee that.
	*** Worst case thread2 gets NULL i.e. FileStatus object for /a.txt does not exist in the
cache so thread2 would fall back to invoke ADL backend call to GetFileStatus.
	*** Does not break any functionality in this case as well.
** *Scenario 4: Concurrent get/remove request for the same FileStatus object*
	{code:java}
	public String thread1()
	{
	    Path f = new Path("/a.txt");
		...
		fileStatusCacheManager.remove(makeQualified(f)); // Cache cleanup caused due to delete/rename/Create
operation on /a.txt. Race condition
		...
	}
	...
	public String thread2()
	{
	    Path f = new Path("/a.txt");
		...
		FileStatus fileStatus = fileStatusCacheManager.get(makeQualified(f)); // Race condition
		...
	}
	{code}
	*** Depending upon order of execution thread2 may get stale information from the cache. Similar
to the above scenario, synchronization of blocks are not going to solve this either
	*** Unavoidable situation with/without FileStatus cache and with/without ADL storage backend.
** *Scenario 5: Concurrent put/remove request for the different FileStatus object*
	{code:java}
	public String thread1()
	{
	    Path f = new Path("/a.txt");
		...
		fileStatusCacheManager.remove(makeQualified(f)); // Cache cleanup caused due to delete/rename/Create
operation on /a.txt. Race condition
		...
	}
	...
	public String thread2()
	{
	    // FileStatus fileStatus - For storage filepath /a.txt 
		...
		fileStatusCacheManager.put(fileStatus,5); // Race condition
		...
	}
	{code}
	*** Depending upon order of execution, FileStatus cache may hold a stale instance for 5 seconds.
Similar to above, synchronization of blocks are not going to solve this either.
	*** This is a corner case and may involve misbehavior to the application, based on there
use case. In such situation FileStatus cache should be turned off.

h6. For volatile usage - Totally agree with you. Like i mentioned in the earlier comment,
i will remove volatile usage for those variables.

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch,
HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft Azure Data
Lake Store (ADL) from within Hadoop. This would enable existing Hadoop applications such has
MR, HIVE, Hbase etc..,  to use ADL store as input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich management and
security features. More details available at https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message