apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Deshmukh <sand...@datatorrent.com>
Subject Re: Adding ParquetReaderOperator in Malhar
Date Tue, 22 Mar 2016 09:50:19 GMT
Good suggestion Tushar but would require some time to achieve this. Hence,
I would suggest that we go ahead with the basic Parquet support as of now
and take suggestion by Tushar as a long term strategy for redesigning
FileSplitter and BlockReader combination in line with MR approach.



Regards,
Sandeep

On Thu, Mar 17, 2016 at 5:15 PM, Tushar Gosavi <tushar@datatorrent.com>
wrote:

> Does it make sense to use InputSplit in FileSplitterInput to generate the
> filesplit information,  and InputFormat in BlockReader to read the records,
> This way we could read variety of formats
> already supported by hadoop in Apex. Parque has an InputFormat and
> InputSplit defined.
>
> - Tushar.
>
>
> On Mon, Mar 14, 2016 at 5:54 PM, Pradeep Dalvi <
> pradeep.dalvi@datatorrent.com> wrote:
>
> > +1
> >
> > On Mon, Mar 14, 2016 at 5:19 PM, Chinmay Kolhatkar <chinmay@apache.org>
> > wrote:
> >
> > > +1.
> > >
> > > On Mon, Mar 14, 2016 at 2:55 PM, Devendra Tagare <
> > > devendrat@datatorrent.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Using parquet.block.size = 128/256 MB on the writer side will ensure
> > that
> > > > the column chunks are not stripped across blocks for a large file.
> > > >
> > > > The reader can then read the individual row groups iteratively.
> > > >
> > > > The FileSplitter would then split the files at the given size into
> > > separate
> > > > chunks that can be handled downstream.
> > > >
> > > > Dev
> > > >
> > > > On Mon, Mar 14, 2016 at 12:31 PM, Shubham Pathak <
> > > shubham@datatorrent.com>
> > > > wrote:
> > > >
> > > > > @Tushar,
> > > > >
> > > > > A parquet file looks like this:
> > > > >
> > > > > 4-byte magic number "PAR1"
> > > > > <Column 1 Chunk 1 + Column Metadata>
> > > > > <Column 2 Chunk 1 + Column Metadata>
> > > > > ...
> > > > > <Column N Chunk 1 + Column Metadata>
> > > > > <Column 1 Chunk 2 + Column Metadata>
> > > > > <Column 2 Chunk 2 + Column Metadata>
> > > > > ...
> > > > > <Column N Chunk 2 + Column Metadata>
> > > > > ...
> > > > > <Column 1 Chunk M + Column Metadata>
> > > > > <Column 2 Chunk M + Column Metadata>
> > > > > ...
> > > > > <Column N Chunk M + Column Metadata>
> > > > > File Metadata
> > > > > 4-byte length in bytes of file metadata
> > > > > 4-byte magic number "PAR1"
> > > > >
> > > > > Praquet being a binary columnar storage format,  readers are
> expected
> > > > > to first read the file metadata to find all the column chunks they
> > are
> > > > > interested in. The columns chunks should then be read sequentially.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra <
> > > yogidevendra@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > +1 for Parquet reader.
> > > > > >
> > > > > > ~ Yogi
> > > > > >
> > > > > > On 14 March 2016 at 11:41, Yogi Devendra <
> yogidevendra@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Shubham,
> > > > > > >
> > > > > > > I feel that instead of having an operator; it should be
a
> plugin
> > to
> > > > the
> > > > > > > input operator.
> > > > > > >
> > > > > > > So that, if someone has some other input operator for a
> > particular
> > > > file
> > > > > > > system (extending AbstractFileInputOperator) he should
be able
> to
> > > > read
> > > > > > > Parquet file from that file system using this plugin.
> > > > > > >
> > > > > > > ~ Yogi
> > > > > > >
> > > > > > > On 14 March 2016 at 11:31, Tushar Gosavi <
> tushar@datatorrent.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> +1
> > > > > > >>
> > > > > > >> Does Parquet support partitioned read from a single
file? If
> yes
> > > > then
> > > > > > may
> > > > > > >> be we can also add support in FileSplitterInput and
> BlockReader
> > to
> > > > > read
> > > > > > >> single file parallely.
> > > > > > >>
> > > > > > >> - Tushar.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <
> > > > > > >> devendrat@datatorrent.com
> > > > > > >> > wrote:
> > > > > > >>
> > > > > > >> > + 1
> > > > > > >> >
> > > > > > >> > ~Dev
> > > > > > >> >
> > > > > > >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak
<
> > > > > > >> shubham@datatorrent.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Hello Community,
> > > > > > >> > >
> > > > > > >> > > I am working on developing a ParquetReaderOperator
which
> > will
> > > > > allow
> > > > > > >> apex
> > > > > > >> > > users to read parquet files.
> > > > > > >> > >
> > > > > > >> > > Apache Parquet is a columnar storage format
available to
> any
> > > > > project
> > > > > > >> in
> > > > > > >> > the
> > > > > > >> > > Hadoop ecosystem, regardless of the choice
of data
> > processing
> > > > > > >> framework,
> > > > > > >> > > data model or programming language.
> > > > > > >> > > For more information : Apache Parquet
> > > > > > >> > > <https://parquet.apache.org/documentation/latest/>
> > > > > > >> > >
> > > > > > >> > > Proposed design :
> > > > > > >> > >
> > > > > > >> > >    1. Develop  AbstractParquetFileReaderOperator
that
> > extends
> > > > > > >> > >    from AbstractFileInputOperator.
> > > > > > >> > >    2. Override openFile() method to instantiate
a
> > > ParquetReader
> > > > (
> > > > > > >> reader
> > > > > > >> > >    provided by parquet-mr <
> > > > https://github.com/Parquet/parquet-mr>
> > > > > > >> > project
> > > > > > >> > >    that reads parquet records from a file
) with
> > > > GroupReadSupport
> > > > > (
> > > > > > >> > records
> > > > > > >> > >    would be read as Group ) .
> > > > > > >> > >    3. Override  readEntity() method to read
the records
> and
> > > call
> > > > > > >> > >    convertGroup() method.  Derived classes
to override
> > > > > > convertGroup()
> > > > > > >> > > method
> > > > > > >> > >    to convert Group to any form required
by downstream
> > > > operators.
> > > > > > >> > >    4. Provide a concrete implementation,
> > ParquetFilePOJOReader
> > > > > > >> operator
> > > > > > >> > >    that extends from AbstractParquetFileReaderOperator
and
> > > > > > >> > >    overrides convertGroup() method to convert
a given
> Group
> > to
> > > > > POJO.
> > > > > > >> > >
> > > > > > >> > > Parquet schema and directory path would be
inputs to the
> > base
> > > > > > >> operator.
> > > > > > >> > For
> > > > > > >> > > ParquetFilePOJOReader, pojo class would also
be required.
> > > > > > >> > >
> > > > > > >> > > Please feel free to let me know your thoughts
on this.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Shubham
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Pradeep A. Dalvi
> >
> > Software Engineer
> > DataTorrent (India)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message