apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shubham Pathak <shub...@datatorrent.com>
Subject Adding ParquetReaderOperator in Malhar
Date Mon, 14 Mar 2016 05:42:43 GMT
Hello Community,

I am working on developing a ParquetReaderOperator which will allow apex
users to read parquet files.

Apache Parquet is a columnar storage format available to any project in the
Hadoop ecosystem, regardless of the choice of data processing framework,
data model or programming language.
For more information : Apache Parquet

Proposed design :

   1. Develop  AbstractParquetFileReaderOperator that extends
   from AbstractFileInputOperator.
   2. Override openFile() method to instantiate a ParquetReader ( reader
   provided by parquet-mr <https://github.com/Parquet/parquet-mr> project
   that reads parquet records from a file ) with GroupReadSupport ( records
   would be read as Group ) .
   3. Override  readEntity() method to read the records and call
   convertGroup() method.  Derived classes to override convertGroup() method
   to convert Group to any form required by downstream operators.
   4. Provide a concrete implementation, ParquetFilePOJOReader operator
   that extends from AbstractParquetFileReaderOperator and
   overrides convertGroup() method to convert a given Group to POJO.

Parquet schema and directory path would be inputs to the base operator. For
ParquetFilePOJOReader, pojo class would also be required.

Please feel free to let me know your thoughts on this.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message