apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yogi Devendra <devendra.vyavah...@gmail.com>
Subject Re: Reading large HDFS files record by record
Date Fri, 29 Apr 2016 13:13:52 GMT
Yes. Single file read in parallel will be supported similar to
FileSplitter+BlockReader combination.

~ Yogi

On 29 April 2016 at 15:58, Sandeep Deshmukh <sandeep@datatorrent.com> wrote:

> +1
>
> Will this support reading a single file in parallel?
> On 29-Apr-2016 3:27 pm, "Mohit Jotwani" <mohit@datatorrent.com> wrote:
>
> > +1
> >
> > Regards,
> > Mohit
> >
> > On Thu, Apr 28, 2016 at 4:29 PM, Yogi Devendra <
> > devendra.vyavahare@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > My usecase involves reading from HDFS and emit each record as a
> separate
> > > tuple. Record can be either fixed length record or separator based
> record
> > > (such as newline).  Expected output is byte[] for each record.
> > >
> > > I am planning to solve this as follows:
> > > - New operator which extends BlockReader.
> > > - It will have configuration option to select mode for FIXED_LENGTH,
> > > SEPARATOR_BASED.
> > > - Use appropriate ReaderContext based on mode.
> > >
> > > Reason for having different operator than BlockReader is because output
> > > port signature is different than BlockReader. This new operator can be
> > used
> > > in conjunction with FileSplitter.
> > >
> > > Any feedback?
> > >
> > > ~ Yogi
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message