hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Shaposhnik <ro...@shaposhnik.org>
Subject Re: Support orc format
Date Tue, 21 Jun 2016 00:38:03 GMT
On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <mli@pivotal.io> wrote:
> Hi Guys,
>
> ORC (Optimized Row Columnar) is a very popular open source format adopted
> in some major components in Hadoop eco-system. It is also used by a lot of
> users. The advantages of supporting ORC storage in HAWQ are in two folds:
> firstly, it makes HAWQ more Hadoop native which interacts with other
> components more easily; secondly, ORC stores some meta info for query
> optimization, thus, it might potentially outperform two native formats
> (i.e., AO, Parquet) if it is available.
>
> Since there are lots of popular formats available in HDFS community, and
> more advanced formats are emerging frequently. It is good option for HAWQ
> to design a general framework that supports pluggable c/c++ formats such as
> ORC, as well as native format such as AO and Parquet. In designing this
> framework, we also need to support data stored in different file systems:
> HDFS, local disk, amazon S3, etc. Thus, it is better to offer a framework
> to support pluggable formats and pluggable file systems.
>
> We are proposing support ORC in JIRA (
> https://issues.apache.org/jira/browse/HAWQ-786). Please see the design spec
> in the JIRA.
>
> Your comments are appreciated!

This sounds reasonable, but I'd like to understand the trade-offs
between supporting
something like ORC in PXF vs. implementing it natively in C/C++.

Is there any hard performance/etc. data that you could share to illuminated the
tradeoffs between these two approaches?

Thanks,
Roman.

Mime
View raw message