hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leon Zhang <leonca...@gmail.com>
Subject Re: Question About PXF
Date Tue, 03 Nov 2015 07:20:31 GMT
Thanks for you reply.

In our test, we can see HAWQ's Managed table are extremely fast. By
comparing with PXF(Hive ORC) at same data size, for example 1G/10G
data generated from TPC-DS, we can see the huge increase of running
time of each query. It seems all IO traffic will request to
pxf-service. As data grows, it becomes a bottleneck.

Intuitively, I think mix usage of Hive and HAWQ is sexy. We would like
to hear some advises about how to improve it at all ways. Especially
the way to scale HAWQ with external data sources.


On Mon, Nov 2, 2015 6:55 PM Ting(Goden) Yao" <tyao@pivotal.io> wrote:

> Thanks for your interests in HAWQ, Leon.
> Can you be more specific regarding what you mean by "bottleneck" - any
> database system could have one or more bottle necks, which depends on your
> data flow patterns, query plan and execution, etc.
> In terms of PXF, it's a java based framework to allow HAWQ to access data
> files stored on external storage or locations which are not directly
> managed by HAWQ system.
> For Hive ORC tables, first of all, PXF uses Hive APIs to access any file
> format supported by Hive, so it doesn't matter if it's ORC or RC or Parquet
> format you have in Hive. (PXF does provide a few *optimized* profile to
> access certain formats though, see:
> http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html
> )
> The overall performance is determined by 1) Hive's API performance 2) PXF's
> data retrieving, filtering, aggregation and sending back to HAWQ
> HAWQ has no control of 1) but we can certainly discuss 2) if you see any
> performance issues or improvements we can work on.
> -Goden
> On Mon, Nov 2, 2015 at 8:46 AM Leon Zhang <leoncamel@gmail.com> wrote:
> > Hi, HAWQ dev,
> >
> >
> > I am new to HAWQ, so I have some question about the design of PXF. As far
> > as I know, PXF-service is a tomcat service, which serve external data
> > source for HAWQ master in a RESTful way.
> >
> > My question is, will PXF-service become the bottleneck? Especially for the
> > case of Hive ORC tables?
> >
> > Thanks.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message