apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pradeep Dalvi <pradeep.da...@datatorrent.com>
Subject Re: Adding ParquetReaderOperator in Malhar
Date Mon, 14 Mar 2016 12:24:30 GMT
+1

On Mon, Mar 14, 2016 at 5:19 PM, Chinmay Kolhatkar <chinmay@apache.org>
wrote:

> +1.
>
> On Mon, Mar 14, 2016 at 2:55 PM, Devendra Tagare <
> devendrat@datatorrent.com>
> wrote:
>
> > Hi,
> >
> > Using parquet.block.size = 128/256 MB on the writer side will ensure that
> > the column chunks are not stripped across blocks for a large file.
> >
> > The reader can then read the individual row groups iteratively.
> >
> > The FileSplitter would then split the files at the given size into
> separate
> > chunks that can be handled downstream.
> >
> > Dev
> >
> > On Mon, Mar 14, 2016 at 12:31 PM, Shubham Pathak <
> shubham@datatorrent.com>
> > wrote:
> >
> > > @Tushar,
> > >
> > > A parquet file looks like this:
> > >
> > > 4-byte magic number "PAR1"
> > > <Column 1 Chunk 1 + Column Metadata>
> > > <Column 2 Chunk 1 + Column Metadata>
> > > ...
> > > <Column N Chunk 1 + Column Metadata>
> > > <Column 1 Chunk 2 + Column Metadata>
> > > <Column 2 Chunk 2 + Column Metadata>
> > > ...
> > > <Column N Chunk 2 + Column Metadata>
> > > ...
> > > <Column 1 Chunk M + Column Metadata>
> > > <Column 2 Chunk M + Column Metadata>
> > > ...
> > > <Column N Chunk M + Column Metadata>
> > > File Metadata
> > > 4-byte length in bytes of file metadata
> > > 4-byte magic number "PAR1"
> > >
> > > Praquet being a binary columnar storage format,  readers are expected
> > > to first read the file metadata to find all the column chunks they are
> > > interested in. The columns chunks should then be read sequentially.
> > >
> > >
> > >
> > > On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra <
> yogidevendra@apache.org
> > >
> > > wrote:
> > >
> > > > +1 for Parquet reader.
> > > >
> > > > ~ Yogi
> > > >
> > > > On 14 March 2016 at 11:41, Yogi Devendra <yogidevendra@apache.org>
> > > wrote:
> > > >
> > > > > Shubham,
> > > > >
> > > > > I feel that instead of having an operator; it should be a plugin
to
> > the
> > > > > input operator.
> > > > >
> > > > > So that, if someone has some other input operator for a particular
> > file
> > > > > system (extending AbstractFileInputOperator) he should be able to
> > read
> > > > > Parquet file from that file system using this plugin.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > > On 14 March 2016 at 11:31, Tushar Gosavi <tushar@datatorrent.com>
> > > wrote:
> > > > >
> > > > >> +1
> > > > >>
> > > > >> Does Parquet support partitioned read from a single file? If
yes
> > then
> > > > may
> > > > >> be we can also add support in FileSplitterInput and BlockReader
to
> > > read
> > > > >> single file parallely.
> > > > >>
> > > > >> - Tushar.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <
> > > > >> devendrat@datatorrent.com
> > > > >> > wrote:
> > > > >>
> > > > >> > + 1
> > > > >> >
> > > > >> > ~Dev
> > > > >> >
> > > > >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <
> > > > >> shubham@datatorrent.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hello Community,
> > > > >> > >
> > > > >> > > I am working on developing a ParquetReaderOperator
which will
> > > allow
> > > > >> apex
> > > > >> > > users to read parquet files.
> > > > >> > >
> > > > >> > > Apache Parquet is a columnar storage format available
to any
> > > project
> > > > >> in
> > > > >> > the
> > > > >> > > Hadoop ecosystem, regardless of the choice of data
processing
> > > > >> framework,
> > > > >> > > data model or programming language.
> > > > >> > > For more information : Apache Parquet
> > > > >> > > <https://parquet.apache.org/documentation/latest/>
> > > > >> > >
> > > > >> > > Proposed design :
> > > > >> > >
> > > > >> > >    1. Develop  AbstractParquetFileReaderOperator that
extends
> > > > >> > >    from AbstractFileInputOperator.
> > > > >> > >    2. Override openFile() method to instantiate a
> ParquetReader
> > (
> > > > >> reader
> > > > >> > >    provided by parquet-mr <
> > https://github.com/Parquet/parquet-mr>
> > > > >> > project
> > > > >> > >    that reads parquet records from a file ) with
> > GroupReadSupport
> > > (
> > > > >> > records
> > > > >> > >    would be read as Group ) .
> > > > >> > >    3. Override  readEntity() method to read the records
and
> call
> > > > >> > >    convertGroup() method.  Derived classes to override
> > > > convertGroup()
> > > > >> > > method
> > > > >> > >    to convert Group to any form required by downstream
> > operators.
> > > > >> > >    4. Provide a concrete implementation, ParquetFilePOJOReader
> > > > >> operator
> > > > >> > >    that extends from AbstractParquetFileReaderOperator
and
> > > > >> > >    overrides convertGroup() method to convert a given
Group to
> > > POJO.
> > > > >> > >
> > > > >> > > Parquet schema and directory path would be inputs to
the base
> > > > >> operator.
> > > > >> > For
> > > > >> > > ParquetFilePOJOReader, pojo class would also be required.
> > > > >> > >
> > > > >> > > Please feel free to let me know your thoughts on this.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Shubham
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Pradeep A. Dalvi

Software Engineer
DataTorrent (India)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message