hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Goden Yao <goden...@apache.org>
Subject Re: Support orc format
Date Tue, 21 Jun 2016 17:39:03 GMT
This is not comparable as native vs. external.
The design doc attached in HAWQ-786
<https://issues.apache.org/jira/browse/HAWQ-786>, as some community
responses in the JIRA, is mixing up an External Table data access framework
with a file format support.

If the JIRA is merely about using ORC as native file format as we see its
popularity in the Hadoop community and potentially want to replace parquet
with ORC as default for its benefits and advantages, this JIRA should be
focusing on the native file format part and how to integrate with C library
from Apache ORC project.

To answer Roman's questions, I think we first need to understand user
scenario with external tables (with ORC format), which is users :
1) already have ORC files landed in HDFS (or stored as Hive tables)
2) want to query from HAWQ, so they may get performance gain with MPP
architecture provided by HAWQ, instead of MR jobs.
3) want to avoid data duplication, which means they don't want to load data
into HAWQ native format (so doesn't matter what native format HAWQ uses to
store the table)

Given that, I think it's worth a further discussion in the theme of
improving external data source access/query performance.

Thanks
-Goden



On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <lei_chang@apache.org> wrote:

> On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik <roman@shaposhnik.org>
> wrote:
>
> > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <mli@pivotal.io> wrote:
> > > Hi Guys,
> > >
> > > ORC (Optimized Row Columnar) is a very popular open source format
> adopted
> > > in some major components in Hadoop eco-system. It is also used by a lot
> > of
> > > users. The advantages of supporting ORC storage in HAWQ are in two
> folds:
> > > firstly, it makes HAWQ more Hadoop native which interacts with other
> > > components more easily; secondly, ORC stores some meta info for query
> > > optimization, thus, it might potentially outperform two native formats
> > > (i.e., AO, Parquet) if it is available.
> > >
> > > Since there are lots of popular formats available in HDFS community,
> and
> > > more advanced formats are emerging frequently. It is good option for
> HAWQ
> > > to design a general framework that supports pluggable c/c++ formats
> such
> > as
> > > ORC, as well as native format such as AO and Parquet. In designing this
> > > framework, we also need to support data stored in different file
> systems:
> > > HDFS, local disk, amazon S3, etc. Thus, it is better to offer a
> framework
> > > to support pluggable formats and pluggable file systems.
> > >
> > > We are proposing support ORC in JIRA (
> > > https://issues.apache.org/jira/browse/HAWQ-786). Please see the design
> > spec
> > > in the JIRA.
> > >
> > > Your comments are appreciated!
> >
> > This sounds reasonable, but I'd like to understand the trade-offs
> > between supporting
> > something like ORC in PXF vs. implementing it natively in C/C++.
> >
> > Is there any hard performance/etc. data that you could share to
> > illuminated the
> > tradeoffs between these two approaches?
> >
>
> Implementing it natively in C/C++ will get at least comparable performance
> with current native AO and parquet format.
>
> And we know that ao and parquet is faster than pxf, so we are expecting
> better performance here.
>
> Cheers
> Lei
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message