reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1143) Adding API to allow deserialize data from remote files directly
Date Fri, 29 Jan 2016 01:42:40 GMT

    [ https://issues.apache.org/jira/browse/REEF-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122708#comment-15122708
] 

Julia commented on REEF-1143:
-----------------------------

In the current implementation, T Deserialize(string fileFolder) is used for local files and
T Deserialize(ISet<string> filePaths) is used for remote file. Shall we reflect this
in the method names? I am little hesitate as both are up to client's implementation. 

Another option is to only keep second API and deprecate the first one, leaving copyToLocal
to the deserializer. But that way would make deserializer implementation complex as copy from
remote to local is done through FIleSystem, and we can easily wrap it inside FileSystemInputPartition
as FIleSystem is already injected in the class. This way would make deserialzer only focus
on deserialize logic for given files.  

> Adding API to allow deserialize data from remote files directly
> ---------------------------------------------------------------
>
>                 Key: REEF-1143
>                 URL: https://issues.apache.org/jira/browse/REEF-1143
>             Project: REEF
>          Issue Type: New Feature
>          Components: REEF-IO
>            Reporter: Julia
>            Assignee: Julia
>
> Currently, Deserialize(string fileFolder) in IFileDeSerializer is used to deserialize
localfiles in a given local file folder. For a set of remote files,  FileSystemInputPartition
first downloads remote files to a local folder, then passes the folder to Deserialize(string
fileFolder) method. 
> For remote files, especially when file size is huge, we would need to read file data
chuck by chuck and consume the data instead of downloading the entire file at once. As the
remote file paths provided are in a set and the folder of the remote files are controlled
at caller side and it may contain some other files, so we cannot just simply use the folder
name, but individual remote file names instead. Therefor the new API for remote file deserialize
would be 
>  T Deserialize(ISet<string> filePaths);
>  
> This would end up two methods in IFileDeSerializer<T>:
>  T Deserialize(string fileFolder);  -- for local file
>  T Deserialize(ISet<string> filePaths); 
> In fact, implementation of the interface is up to the one who implement it. For the second
API, we can use the FileSyetm injected to the Deserializer to determine if it is to access
local files or remote files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message