apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priyanka Gugale <priya...@datatorrent.com>
Subject Re: HDFS File Reader Module
Date Tue, 23 Feb 2016 13:15:28 GMT
I haven't created any branch yet, should share it with you as soon as I add
the code for module.
Surely would be happy to help :)

-Priyanka

On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <yogidevendra@apache.org>
wrote:

> Priyanka,
>
> Thanks for the update. I will consider these ports during the design phase
> of my proposal for HDFS file copy module.
>
> I believe you are planning to add this to Apex Malhar. Please post any link
> / private branch (if any) where I can have a look at the first cut.
>
> I will ask for your help if I come across any questions, uncertainties etc.
>
> ~ Yogi
>
> On 23 February 2016 at 17:59, Priyanka Gugale <priyanka@datatorrent.com>
> wrote:
>
> > I am planning to have following ports to this module:
> >
> > Ports
> > Input port: None
> >
> > Output port:
> >
> >    1. FileMetadata
> >    2. BlockMetadata
> >    3. Block bytes
> >
> > -Priyanka
> >
> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <yogidevendra@apache.org>
> > wrote:
> >
> > > Priyanka,
> > >
> > > Can you please share details about what would be the output ports from
> > this
> > > module?
> > >
> > > I am thinking of HDFS File Copy Module which can be used in conjunction
> > > with this module to copy files from HDFS to HDFS.
> > >
> > > ~ Yogi
> > >
> > > On 18 February 2016 at 10:29, Mohit Jotwani <mohit@datatorrent.com>
> > wrote:
> > >
> > > > +1 to add this.
> > > >
> > > > Regards,
> > > > Mohit
> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <pramod@datatorrent.com>
> > wrote:
> > > >
> > > > > +1 to add this module
> > > > >
> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> > > > priyanka@datatorrent.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > We need partitions for parallel read but how will the reader
> > > partition
> > > > > know
> > > > > > which offset of the file it should read from. Normally
> FileSplitter
> > > > > creates
> > > > > > this metadata, let's call them as reader task, and forwards
them
> to
> > > > next
> > > > > > operator which is block reader. Block reader will receive one
of
> > the
> > > > > tasks
> > > > > > and read from specified offset in file. If FileSplitter is absent
> > one
> > > > > > reader partition will have to consume one file entirely, which
> > means
> > > we
> > > > > > can't have parallel reading over one file. I hope this answers
> your
> > > > > > question.
> > > > > >
> > > > > > Advantage of having this module is having a reusable component
> made
> > > up
> > > > of
> > > > > > operators which are frequently used together to do file reading.
> > > > > >
> > > > > > -Priyanka
> > > > > >
> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> > > > yogidevendra@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Let me rephrase Ram's question to make it clear:
> > > > > > >
> > > > > > > For an application developer using Malhar:
> > > > > > > What are the advantages / disadvantages of using the proposed
> > HDFS
> > > > File
> > > > > > > input Module as compared to directly using FileSplitter,
> > > BlockReader
> > > > > > > Operators available in Malhar?
> > > > > > >
> > > > > > > ~ Yogi
> > > > > > >
> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
> > > ram@datatorrent.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Can parallel read not be achieved by partitioning
?
> > > > > > > >
> > > > > > > > Ram
> > > > > > > >
> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> > > > > > > priyanka@datatorrent.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > It is a common usecase to read big files on HDFS
in
> parallel
> > > > > fashion
> > > > > > > i.e.
> > > > > > > > > many reader thread are used to read the file
in parallel.
> We
> > > can
> > > > > > > achieve
> > > > > > > > > this on top of Apex using following Malhar operators:
> > > > > > > > >
> > > > > > > > > 1. AbstractFileSplitter
> > > > > > > > > 2. AbstractBlockReader
> > > > > > > > >
> > > > > > > > > where FileSplitter, as per file metadata, creates
small
> > reader
> > > > > > tasks(to
> > > > > > > > > read file in parts). Those reader tasks are run
by
> > BlockReaders
> > > > in
> > > > > > > > parallel
> > > > > > > > > to read the file.
> > > > > > > > >
> > > > > > > > > As these operators are generally used together
to achieve
> > file
> > > > read
> > > > > > > > > operation, I propose we create a module, called
> > HDFSFileReader
> > > > for
> > > > > > > this.
> > > > > > > > >
> > > > > > > > > Please provide your suggestions on same.
> > > > > > > > >
> > > > > > > > > -Priyanka
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message