apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shubham Pathak <shub...@datatorrent.com>
Subject Parquet Writer Operator
Date Mon, 25 Apr 2016 18:08:31 GMT
Hello Community,

Apache Parquet <https://parquet.apache.org/documentation/latest/> is a
columnar oriented binary file format designed to be extremely efficient and
interoperable across Hadoop ecosystem. It has integrations with most of the
Hadoop processing frameworks ( Impala, Hive, Pig, Spark.. ) and
serialization models (Thrift, Avro, Protobuf)  making it easy to use in ETL
and processing pipelines.

Having an operator to write data to Parquet files would certainly be a good
addition to the Malhar library.

The underlying implementation
<http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/>
for writing data as Parquet, requires a subclass of
parquet.hadoop.api.WriteSupport that knows how to take an in-memory object
and write Parquet primitives through parquet.io.api.RecordConsumer*.*
Currently, there are several WriteSupport implementations, including
ThriftWriteSupport,
AvroWriteSupport, and ProtoWriteSupport.
These WriteSupport implementations are then wrapped as ParquetWriter
objects for writing.

Parquet Writers do not expose a handle to the underlying stream. In order
to  write data to a Parquet file, all the records ( that belong to file )
must be buffered in memory. These records are then compressed and later
flushed to the file.

To start with, we could support following features in the operator

   - *Ability to provide a WriteSupport Implementation* : The user should
   be able to use existing implementations of parquet.hadoop.api.
   WriteSupport or provide his/her own implementation.
   - *Ability to configure Page Size : *Refers to the amount of
   uncompressed data for a single column that is read before it is compressed
   as a unit and buffered in memory to be written out as a “page”. Default
   value : 1MB
   - *Ability to configure Parquet Block Size : *Refers to the amount of
   compressed data that should be buffered in memory before a row group is
   written out to disk. Larger block sizes require more memory to buffer the
   data; Recommended is 128 MB / 256 MB
   - *Flushing files periodically* :Operator would have to flush files
   periodically in a specified directory as per configured block size . This
   could be time-based / number of events based  / size based

To implement the operator, here's one approach  :

   1. Extend existing AbstractFileOutputOperator
   2.  Provide methods to add write support implementations.
   3. In process method, hold the data in memory till we reach a configured
   size and then flush  the contents to a file during endWindow().

Please send across your thoughts on this. I would also like to know if we
would be able to leverage recovery mechanisms provided by
AbstractFileOutputOperator using this approach?


Thanks,
Shubham

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message