apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandesh Hegde <sand...@datatorrent.com>
Subject Re: HDFS File Reader Module
Date Wed, 02 Mar 2016 18:50:10 GMT
My vote is to have a separate namespace for modules.

Is it time to introduce
org.apache.apex.module.io.fs ?

On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <priyanka@datatorrent.com>
wrote:

> I am planning to put this module in malhar-library project in
> package: com.datatorrent.lib.io.fs
> Let me know if this is acceptable?
>
> -Priyanka
>
> On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <priyanka@datatorrent.com
> >
> wrote:
>
> > I haven't created any branch yet, should share it with you as soon as I
> > add the code for module.
> > Surely would be happy to help :)
> >
> > -Priyanka
> >
> > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <yogidevendra@apache.org>
> > wrote:
> >
> >> Priyanka,
> >>
> >> Thanks for the update. I will consider these ports during the design
> phase
> >> of my proposal for HDFS file copy module.
> >>
> >> I believe you are planning to add this to Apex Malhar. Please post any
> >> link
> >> / private branch (if any) where I can have a look at the first cut.
> >>
> >> I will ask for your help if I come across any questions, uncertainties
> >> etc.
> >>
> >> ~ Yogi
> >>
> >> On 23 February 2016 at 17:59, Priyanka Gugale <priyanka@datatorrent.com
> >
> >> wrote:
> >>
> >> > I am planning to have following ports to this module:
> >> >
> >> > Ports
> >> > Input port: None
> >> >
> >> > Output port:
> >> >
> >> >    1. FileMetadata
> >> >    2. BlockMetadata
> >> >    3. Block bytes
> >> >
> >> > -Priyanka
> >> >
> >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <
> yogidevendra@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > Priyanka,
> >> > >
> >> > > Can you please share details about what would be the output ports
> from
> >> > this
> >> > > module?
> >> > >
> >> > > I am thinking of HDFS File Copy Module which can be used in
> >> conjunction
> >> > > with this module to copy files from HDFS to HDFS.
> >> > >
> >> > > ~ Yogi
> >> > >
> >> > > On 18 February 2016 at 10:29, Mohit Jotwani <mohit@datatorrent.com>
> >> > wrote:
> >> > >
> >> > > > +1 to add this.
> >> > > >
> >> > > > Regards,
> >> > > > Mohit
> >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <pramod@datatorrent.com>
> >> > wrote:
> >> > > >
> >> > > > > +1 to add this module
> >> > > > >
> >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> >> > > > priyanka@datatorrent.com
> >> > > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > We need partitions for parallel read but how will the
reader
> >> > > partition
> >> > > > > know
> >> > > > > > which offset of the file it should read from. Normally
> >> FileSplitter
> >> > > > > creates
> >> > > > > > this metadata, let's call them as reader task, and
forwards
> >> them to
> >> > > > next
> >> > > > > > operator which is block reader. Block reader will receive
one
> of
> >> > the
> >> > > > > tasks
> >> > > > > > and read from specified offset in file. If FileSplitter
is
> >> absent
> >> > one
> >> > > > > > reader partition will have to consume one file entirely,
which
> >> > means
> >> > > we
> >> > > > > > can't have parallel reading over one file. I hope this
answers
> >> your
> >> > > > > > question.
> >> > > > > >
> >> > > > > > Advantage of having this module is having a reusable
component
> >> made
> >> > > up
> >> > > > of
> >> > > > > > operators which are frequently used together to do
file
> reading.
> >> > > > > >
> >> > > > > > -Priyanka
> >> > > > > >
> >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> >> > > > yogidevendra@apache.org
> >> > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Let me rephrase Ram's question to make it clear:
> >> > > > > > >
> >> > > > > > > For an application developer using Malhar:
> >> > > > > > > What are the advantages / disadvantages of using
the
> proposed
> >> > HDFS
> >> > > > File
> >> > > > > > > input Module as compared to directly using FileSplitter,
> >> > > BlockReader
> >> > > > > > > Operators available in Malhar?
> >> > > > > > >
> >> > > > > > > ~ Yogi
> >> > > > > > >
> >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath
<
> >> > > ram@datatorrent.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Can parallel read not be achieved by partitioning
?
> >> > > > > > > >
> >> > > > > > > > Ram
> >> > > > > > > >
> >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka
Gugale <
> >> > > > > > > priyanka@datatorrent.com
> >> > > > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi,
> >> > > > > > > > >
> >> > > > > > > > > It is a common usecase to read big files
on HDFS in
> >> parallel
> >> > > > > fashion
> >> > > > > > > i.e.
> >> > > > > > > > > many reader thread are used to read
the file in
> parallel.
> >> We
> >> > > can
> >> > > > > > > achieve
> >> > > > > > > > > this on top of Apex using following
Malhar operators:
> >> > > > > > > > >
> >> > > > > > > > > 1. AbstractFileSplitter
> >> > > > > > > > > 2. AbstractBlockReader
> >> > > > > > > > >
> >> > > > > > > > > where FileSplitter, as per file metadata,
creates small
> >> > reader
> >> > > > > > tasks(to
> >> > > > > > > > > read file in parts). Those reader tasks
are run by
> >> > BlockReaders
> >> > > > in
> >> > > > > > > > parallel
> >> > > > > > > > > to read the file.
> >> > > > > > > > >
> >> > > > > > > > > As these operators are generally used
together to
> achieve
> >> > file
> >> > > > read
> >> > > > > > > > > operation, I propose we create a module,
called
> >> > HDFSFileReader
> >> > > > for
> >> > > > > > > this.
> >> > > > > > > > >
> >> > > > > > > > > Please provide your suggestions on same.
> >> > > > > > > > >
> >> > > > > > > > > -Priyanka
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message