apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priyanka Gugale <priya...@datatorrent.com>
Subject Re: HDFS File Reader Module
Date Wed, 17 Feb 2016 17:21:08 GMT
We need partitions for parallel read but how will the reader partition know
which offset of the file it should read from. Normally FileSplitter creates
this metadata, let's call them as reader task, and forwards them to next
operator which is block reader. Block reader will receive one of the tasks
and read from specified offset in file. If FileSplitter is absent one
reader partition will have to consume one file entirely, which means we
can't have parallel reading over one file. I hope this answers your
question.

Advantage of having this module is having a reusable component made up of
operators which are frequently used together to do file reading.

-Priyanka

On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <yogidevendra@apache.org>
wrote:

> Let me rephrase Ram's question to make it clear:
>
> For an application developer using Malhar:
> What are the advantages / disadvantages of using the proposed HDFS File
> input Module as compared to directly using FileSplitter, BlockReader
> Operators available in Malhar?
>
> ~ Yogi
>
> On 16 February 2016 at 21:56, Munagala Ramanath <ram@datatorrent.com>
> wrote:
>
> > Can parallel read not be achieved by partitioning ?
> >
> > Ram
> >
> > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> priyanka@datatorrent.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > It is a common usecase to read big files on HDFS in parallel fashion
> i.e.
> > > many reader thread are used to read the file in parallel. We can
> achieve
> > > this on top of Apex using following Malhar operators:
> > >
> > > 1. AbstractFileSplitter
> > > 2. AbstractBlockReader
> > >
> > > where FileSplitter, as per file metadata, creates small reader tasks(to
> > > read file in parts). Those reader tasks are run by BlockReaders in
> > parallel
> > > to read the file.
> > >
> > > As these operators are generally used together to achieve file read
> > > operation, I propose we create a module, called HDFSFileReader for
> this.
> > >
> > > Please provide your suggestions on same.
> > >
> > > -Priyanka
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message