hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivram Mani <shivram.m...@gmail.com>
Subject Re: Support orc format
Date Thu, 23 Jun 2016 01:39:02 GMT
Yes. Two separate Jiras would be more apt.

FDW came from SQL/MED standard and is now part of SQL. The contradiction
here is, part of the standard is 'Foreign table' which overlaps with HAWQ's
'External Table' (along with protocols introduced such as PXF, gpfdist).

I don't think we should have any more discussions about the framework in
this thread as the subject is strictly ORC support.

On Wed, Jun 22, 2016 at 5:02 PM, Lei Chang <lei_chang@apache.org> wrote:

> On Thu, Jun 23, 2016 at 6:17 AM, Ting(Goden) Yao <tyao@pivotal.io> wrote:
>
> > 1) the framework is not designed by HAWQ community - it was from Postgres
> >
>
> It is not correct. fdw is SQL standard. we are try to following the
> standards. postgres has a implementation. and hawq can also have one and
> from the spec, you can see that it is designed for hawq by us.
>
>
> > 2) the JIRA itself is titled as "ORC as native format" which has nothing
> to
> > do with this framework
> >
>
> thanks for pointing this out, the title is somewhat confusing, I will
> change it to a more general one, or separate the two into two umbrella
> JIRAs.
>
>
> >
> > We should not try to lump multiple features, ideas in one JIRA
> >
> >
> > On Wed, Jun 22, 2016 at 12:28 AM Lei Chang <lei_chang@apache.org> wrote:
> >
> > > On Wed, Jun 22, 2016 at 1:39 AM, Goden Yao <godenyao@apache.org>
> wrote:
> > >
> > > > This is not comparable as native vs. external.
> > > > The design doc attached in HAWQ-786
> > > > <https://issues.apache.org/jira/browse/HAWQ-786>, as some community
> > > > responses in the JIRA, is mixing up an External Table data access
> > > framework
> > > > with a file format support.
> > > >
> > > > If the JIRA is merely about using ORC as native file format as we see
> > its
> > > > popularity in the Hadoop community and potentially want to replace
> > > parquet
> > > > with ORC as default for its benefits and advantages, this JIRA should
> > be
> > > > focusing on the native file format part and how to integrate with C
> > > library
> > > > from Apache ORC project.
> > > >
> > >
> > >
> > > as it was described in the JIRA. the framework is designed as a general
> > > framework.
> > >
> > > it can also potentially be used for external data. there is an example
> > > showing the usage.
> > >
> > >
> > > >
> > > > To answer Roman's questions, I think we first need to understand user
> > > > scenario with external tables (with ORC format), which is users :
> > > > 1) already have ORC files landed in HDFS (or stored as Hive tables)
> > > > 2) want to query from HAWQ, so they may get performance gain with MPP
> > > > architecture provided by HAWQ, instead of MR jobs.
> > > > 3) want to avoid data duplication, which means they don't want to
> load
> > > data
> > > > into HAWQ native format (so doesn't matter what native format HAWQ
> uses
> > > to
> > > > store the table)
> > > >
> > > > Given that, I think it's worth a further discussion in the theme of
> > > > improving external data source access/query performance.
> > > >
> > > > Thanks
> > > > -Goden
> > > >
> > > >
> > > >
> > > > On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <lei_chang@apache.org>
> > wrote:
> > > >
> > > > > On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik <
> > > roman@shaposhnik.org>
> > > > > wrote:
> > > > >
> > > > > > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <mli@pivotal.io>
wrote:
> > > > > > > Hi Guys,
> > > > > > >
> > > > > > > ORC (Optimized Row Columnar) is a very popular open source
> format
> > > > > adopted
> > > > > > > in some major components in Hadoop eco-system. It is also
used
> > by a
> > > > lot
> > > > > > of
> > > > > > > users. The advantages of supporting ORC storage in HAWQ
are in
> > two
> > > > > folds:
> > > > > > > firstly, it makes HAWQ more Hadoop native which interacts
with
> > > other
> > > > > > > components more easily; secondly, ORC stores some meta
info for
> > > query
> > > > > > > optimization, thus, it might potentially outperform two
native
> > > > formats
> > > > > > > (i.e., AO, Parquet) if it is available.
> > > > > > >
> > > > > > > Since there are lots of popular formats available in HDFS
> > > community,
> > > > > and
> > > > > > > more advanced formats are emerging frequently. It is good
> option
> > > for
> > > > > HAWQ
> > > > > > > to design a general framework that supports pluggable c/c++
> > formats
> > > > > such
> > > > > > as
> > > > > > > ORC, as well as native format such as AO and Parquet. In
> > designing
> > > > this
> > > > > > > framework, we also need to support data stored in different
> file
> > > > > systems:
> > > > > > > HDFS, local disk, amazon S3, etc. Thus, it is better to
offer a
> > > > > framework
> > > > > > > to support pluggable formats and pluggable file systems.
> > > > > > >
> > > > > > > We are proposing support ORC in JIRA (
> > > > > > > https://issues.apache.org/jira/browse/HAWQ-786). Please
see
> the
> > > > design
> > > > > > spec
> > > > > > > in the JIRA.
> > > > > > >
> > > > > > > Your comments are appreciated!
> > > > > >
> > > > > > This sounds reasonable, but I'd like to understand the trade-offs
> > > > > > between supporting
> > > > > > something like ORC in PXF vs. implementing it natively in C/C++.
> > > > > >
> > > > > > Is there any hard performance/etc. data that you could share
to
> > > > > > illuminated the
> > > > > > tradeoffs between these two approaches?
> > > > > >
> > > > >
> > > > > Implementing it natively in C/C++ will get at least comparable
> > > > performance
> > > > > with current native AO and parquet format.
> > > > >
> > > > > And we know that ao and parquet is faster than pxf, so we are
> > expecting
> > > > > better performance here.
> > > > >
> > > > > Cheers
> > > > > Lei
> > > > >
> > > >
> > >
> >
>



-- 
shivram mani

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message