reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <>
Subject [jira] [Commented] (REEF-1143) Adding API to allow deserialize data from remote files directly
Date Thu, 11 Feb 2016 02:09:18 GMT


Julia commented on REEF-1143:

The proposed wrapper seems neat, however there are a few issues:
1. Open() in REEF IFileSystem is not implemented for all the file systems, e.g. HadoopFileSytem
2. This approach decouples open and close. When we pass all the streams to the wrapper Stream,
all the streams inside have to be opened. During the iteration, they will be closed one by
one. Usually we open one, read data then close, looks clean and safer. 
3. In our application case, we need to get more information from original file, like extents,
length of each extent, this information will determine how to read data from a file. If what
we got is just a Stream, the Stream interface doesn't provide API for getting this information.

> Adding API to allow deserialize data from remote files directly
> ---------------------------------------------------------------
>                 Key: REEF-1143
>                 URL:
>             Project: REEF
>          Issue Type: New Feature
>          Components: REEF-IO
>            Reporter: Julia
>            Assignee: Julia
> Currently, Deserialize(string fileFolder) in IFileDeSerializer is used to deserialize
localfiles in a given local file folder. For a set of remote files,  FileSystemInputPartition
first downloads remote files to a local folder, then passes the folder to Deserialize(string
fileFolder) method. 
> For remote files, especially when file size is huge, we would need to read file data
chuck by chuck and consume the data instead of downloading the entire file at once. As the
remote file paths provided are in a set and the folder of the remote files are controlled
at caller side and it may contain some other files, so we cannot just simply use the folder
name, but individual remote file names instead. Therefor the new API for remote file deserialize
would be 
>  T Deserialize(ISet<string> filePaths);
> This would end up two methods in IFileDeSerializer<T>:
>  T Deserialize(string fileFolder);  -- for local file
>  T Deserialize(ISet<string> filePaths); 
> In fact, implementation of the interface is up to the one who implement it. For the second
API, we can use the FileSyetm injected to the Deserializer to determine if it is to access
local files or remote files. 

This message was sent by Atlassian JIRA

View raw message