apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunil Parmar <spar...@threatmetrix.com>
Subject Partitioned Parquet File output operator
Date Thu, 19 May 2016 16:51:22 GMT
All,
We have a use case to built a data ingestion app that reads data from Kafka, transforms and
write it to HDFS in Parquet File. I am trying to implement a Parquet File output operator
which supports partitions ( defined by input fields ). I would appreciate communities input
for the following.

Staging data
Parquet format stores data organized by column instead of record. Because it keeps data in
contiguous chunks by column, appending new records to a dataset requires rewriting substantial
portions of existing an file or buffering records to create a new file ( data compaction)
. So while Parquet may have storage and query benefits, it may not make sense to write from
record stream.

Partition and sorting strategy implementation
Our use case is for immutable, read only data sets. We plan to use Impala to access the data
once it's built.

Frameworks
I'm leaning towards using kite-sdk ( http://kitesdk.org/ ) as it supports APIs for both staging
, complex data types and partitioning.

  *   In general thoughts about the approach and ideas.
  *   If any of you have faced similar issues or done something like this. Please share your
thoughts, obstacles and code samples if possible.
  *   Apex Dev, if something like this is already planned in Malhar; please let us know.

Thanks,
Sunil

Mime
View raw message