apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Jotwani <mo...@datatorrent.com>
Subject Re: Adding ParquetReaderOperator in Malhar
Date Mon, 14 Mar 2016 08:27:53 GMT
+1

Regards,
Mohit

On Mon, Mar 14, 2016 at 12:31 PM, Shubham Pathak <shubham@datatorrent.com>
wrote:

> @Tushar,
>
> A parquet file looks like this:
>
> 4-byte magic number "PAR1"
> <Column 1 Chunk 1 + Column Metadata>
> <Column 2 Chunk 1 + Column Metadata>
> ...
> <Column N Chunk 1 + Column Metadata>
> <Column 1 Chunk 2 + Column Metadata>
> <Column 2 Chunk 2 + Column Metadata>
> ...
> <Column N Chunk 2 + Column Metadata>
> ...
> <Column 1 Chunk M + Column Metadata>
> <Column 2 Chunk M + Column Metadata>
> ...
> <Column N Chunk M + Column Metadata>
> File Metadata
> 4-byte length in bytes of file metadata
> 4-byte magic number "PAR1"
>
> Praquet being a binary columnar storage format,  readers are expected
> to first read the file metadata to find all the column chunks they are
> interested in. The columns chunks should then be read sequentially.
>
>
>
> On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra <yogidevendra@apache.org>
> wrote:
>
> > +1 for Parquet reader.
> >
> > ~ Yogi
> >
> > On 14 March 2016 at 11:41, Yogi Devendra <yogidevendra@apache.org>
> wrote:
> >
> > > Shubham,
> > >
> > > I feel that instead of having an operator; it should be a plugin to the
> > > input operator.
> > >
> > > So that, if someone has some other input operator for a particular file
> > > system (extending AbstractFileInputOperator) he should be able to read
> > > Parquet file from that file system using this plugin.
> > >
> > > ~ Yogi
> > >
> > > On 14 March 2016 at 11:31, Tushar Gosavi <tushar@datatorrent.com>
> wrote:
> > >
> > >> +1
> > >>
> > >> Does Parquet support partitioned read from a single file? If yes then
> > may
> > >> be we can also add support in FileSplitterInput and BlockReader to
> read
> > >> single file parallely.
> > >>
> > >> - Tushar.
> > >>
> > >>
> > >>
> > >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <
> > >> devendrat@datatorrent.com
> > >> > wrote:
> > >>
> > >> > + 1
> > >> >
> > >> > ~Dev
> > >> >
> > >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <
> > >> shubham@datatorrent.com>
> > >> > wrote:
> > >> >
> > >> > > Hello Community,
> > >> > >
> > >> > > I am working on developing a ParquetReaderOperator which will
> allow
> > >> apex
> > >> > > users to read parquet files.
> > >> > >
> > >> > > Apache Parquet is a columnar storage format available to any
> project
> > >> in
> > >> > the
> > >> > > Hadoop ecosystem, regardless of the choice of data processing
> > >> framework,
> > >> > > data model or programming language.
> > >> > > For more information : Apache Parquet
> > >> > > <https://parquet.apache.org/documentation/latest/>
> > >> > >
> > >> > > Proposed design :
> > >> > >
> > >> > >    1. Develop  AbstractParquetFileReaderOperator that extends
> > >> > >    from AbstractFileInputOperator.
> > >> > >    2. Override openFile() method to instantiate a ParquetReader
(
> > >> reader
> > >> > >    provided by parquet-mr <https://github.com/Parquet/parquet-mr>
> > >> > project
> > >> > >    that reads parquet records from a file ) with GroupReadSupport
> (
> > >> > records
> > >> > >    would be read as Group ) .
> > >> > >    3. Override  readEntity() method to read the records and call
> > >> > >    convertGroup() method.  Derived classes to override
> > convertGroup()
> > >> > > method
> > >> > >    to convert Group to any form required by downstream operators.
> > >> > >    4. Provide a concrete implementation, ParquetFilePOJOReader
> > >> operator
> > >> > >    that extends from AbstractParquetFileReaderOperator and
> > >> > >    overrides convertGroup() method to convert a given Group to
> POJO.
> > >> > >
> > >> > > Parquet schema and directory path would be inputs to the base
> > >> operator.
> > >> > For
> > >> > > ParquetFilePOJOReader, pojo class would also be required.
> > >> > >
> > >> > > Please feel free to let me know your thoughts on this.
> > >> > >
> > >> > > Thanks,
> > >> > > Shubham
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message