apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tushar Gosavi <tus...@datatorrent.com>
Subject Re: Adding ParquetReaderOperator in Malhar
Date Thu, 17 Mar 2016 11:45:08 GMT
Does it make sense to use InputSplit in FileSplitterInput to generate the
filesplit information,  and InputFormat in BlockReader to read the records,
This way we could read variety of formats
already supported by hadoop in Apex. Parque has an InputFormat and
InputSplit defined.

- Tushar.


On Mon, Mar 14, 2016 at 5:54 PM, Pradeep Dalvi <
pradeep.dalvi@datatorrent.com> wrote:

> +1
>
> On Mon, Mar 14, 2016 at 5:19 PM, Chinmay Kolhatkar <chinmay@apache.org>
> wrote:
>
> > +1.
> >
> > On Mon, Mar 14, 2016 at 2:55 PM, Devendra Tagare <
> > devendrat@datatorrent.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Using parquet.block.size = 128/256 MB on the writer side will ensure
> that
> > > the column chunks are not stripped across blocks for a large file.
> > >
> > > The reader can then read the individual row groups iteratively.
> > >
> > > The FileSplitter would then split the files at the given size into
> > separate
> > > chunks that can be handled downstream.
> > >
> > > Dev
> > >
> > > On Mon, Mar 14, 2016 at 12:31 PM, Shubham Pathak <
> > shubham@datatorrent.com>
> > > wrote:
> > >
> > > > @Tushar,
> > > >
> > > > A parquet file looks like this:
> > > >
> > > > 4-byte magic number "PAR1"
> > > > <Column 1 Chunk 1 + Column Metadata>
> > > > <Column 2 Chunk 1 + Column Metadata>
> > > > ...
> > > > <Column N Chunk 1 + Column Metadata>
> > > > <Column 1 Chunk 2 + Column Metadata>
> > > > <Column 2 Chunk 2 + Column Metadata>
> > > > ...
> > > > <Column N Chunk 2 + Column Metadata>
> > > > ...
> > > > <Column 1 Chunk M + Column Metadata>
> > > > <Column 2 Chunk M + Column Metadata>
> > > > ...
> > > > <Column N Chunk M + Column Metadata>
> > > > File Metadata
> > > > 4-byte length in bytes of file metadata
> > > > 4-byte magic number "PAR1"
> > > >
> > > > Praquet being a binary columnar storage format,  readers are expected
> > > > to first read the file metadata to find all the column chunks they
> are
> > > > interested in. The columns chunks should then be read sequentially.
> > > >
> > > >
> > > >
> > > > On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra <
> > yogidevendra@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > +1 for Parquet reader.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > > On 14 March 2016 at 11:41, Yogi Devendra <yogidevendra@apache.org>
> > > > wrote:
> > > > >
> > > > > > Shubham,
> > > > > >
> > > > > > I feel that instead of having an operator; it should be a plugin
> to
> > > the
> > > > > > input operator.
> > > > > >
> > > > > > So that, if someone has some other input operator for a
> particular
> > > file
> > > > > > system (extending AbstractFileInputOperator) he should be able
to
> > > read
> > > > > > Parquet file from that file system using this plugin.
> > > > > >
> > > > > > ~ Yogi
> > > > > >
> > > > > > On 14 March 2016 at 11:31, Tushar Gosavi <tushar@datatorrent.com
> >
> > > > wrote:
> > > > > >
> > > > > >> +1
> > > > > >>
> > > > > >> Does Parquet support partitioned read from a single file?
If yes
> > > then
> > > > > may
> > > > > >> be we can also add support in FileSplitterInput and BlockReader
> to
> > > > read
> > > > > >> single file parallely.
> > > > > >>
> > > > > >> - Tushar.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <
> > > > > >> devendrat@datatorrent.com
> > > > > >> > wrote:
> > > > > >>
> > > > > >> > + 1
> > > > > >> >
> > > > > >> > ~Dev
> > > > > >> >
> > > > > >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <
> > > > > >> shubham@datatorrent.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Hello Community,
> > > > > >> > >
> > > > > >> > > I am working on developing a ParquetReaderOperator
which
> will
> > > > allow
> > > > > >> apex
> > > > > >> > > users to read parquet files.
> > > > > >> > >
> > > > > >> > > Apache Parquet is a columnar storage format available
to any
> > > > project
> > > > > >> in
> > > > > >> > the
> > > > > >> > > Hadoop ecosystem, regardless of the choice of
data
> processing
> > > > > >> framework,
> > > > > >> > > data model or programming language.
> > > > > >> > > For more information : Apache Parquet
> > > > > >> > > <https://parquet.apache.org/documentation/latest/>
> > > > > >> > >
> > > > > >> > > Proposed design :
> > > > > >> > >
> > > > > >> > >    1. Develop  AbstractParquetFileReaderOperator
that
> extends
> > > > > >> > >    from AbstractFileInputOperator.
> > > > > >> > >    2. Override openFile() method to instantiate
a
> > ParquetReader
> > > (
> > > > > >> reader
> > > > > >> > >    provided by parquet-mr <
> > > https://github.com/Parquet/parquet-mr>
> > > > > >> > project
> > > > > >> > >    that reads parquet records from a file ) with
> > > GroupReadSupport
> > > > (
> > > > > >> > records
> > > > > >> > >    would be read as Group ) .
> > > > > >> > >    3. Override  readEntity() method to read the
records and
> > call
> > > > > >> > >    convertGroup() method.  Derived classes to
override
> > > > > convertGroup()
> > > > > >> > > method
> > > > > >> > >    to convert Group to any form required by downstream
> > > operators.
> > > > > >> > >    4. Provide a concrete implementation,
> ParquetFilePOJOReader
> > > > > >> operator
> > > > > >> > >    that extends from AbstractParquetFileReaderOperator
and
> > > > > >> > >    overrides convertGroup() method to convert
a given Group
> to
> > > > POJO.
> > > > > >> > >
> > > > > >> > > Parquet schema and directory path would be inputs
to the
> base
> > > > > >> operator.
> > > > > >> > For
> > > > > >> > > ParquetFilePOJOReader, pojo class would also be
required.
> > > > > >> > >
> > > > > >> > > Please feel free to let me know your thoughts
on this.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Shubham
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Pradeep A. Dalvi
>
> Software Engineer
> DataTorrent (India)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message