hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy Da <jd...@cornell.edu>
Subject Re: Question About PXF
Date Tue, 03 Nov 2015 18:16:48 GMT
Leon,

Have you tried the HiveRC profile found here
http://hawq.docs.pivotal.io/docs-hawq/topics/PXFInstallationandAdministration.html#built-inprofiles
? We added some customization to minimize marshalling Java objects.

Comparing PXF with managed HAWQ tables may not be a fair match considering
HAWQ is hitting everything on its home turf. A more interesting comparison
would be with the Hive performance as they also use the same Java packages
in PXF.

So in short, if we are comparing:
(Hive execution engine + Java file readers) vs (HAWQ execution engine +
PXF/Java file readers)

We would like to see performance gain in the execution side of things.

Jimmy Da

On Mon, Nov 2, 2015 at 11:20 PM, Leon Zhang <leoncamel@gmail.com> wrote:

> Thanks for you reply.
>
> In our test, we can see HAWQ's Managed table are extremely fast. By
> comparing with PXF(Hive ORC) at same data size, for example 1G/10G
> data generated from TPC-DS, we can see the huge increase of running
> time of each query. It seems all IO traffic will request to
> pxf-service. As data grows, it becomes a bottleneck.
>
> Intuitively, I think mix usage of Hive and HAWQ is sexy. We would like
> to hear some advises about how to improve it at all ways. Especially
> the way to scale HAWQ with external data sources.
>
>
> Thanks.
>
>
> On Mon, Nov 2, 2015 6:55 PM Ting(Goden) Yao" <tyao@pivotal.io> wrote:
>
> > Thanks for your interests in HAWQ, Leon.
> >
> > Can you be more specific regarding what you mean by "bottleneck" - any
> > database system could have one or more bottle necks, which depends on
> your
> > data flow patterns, query plan and execution, etc.
> >
> > In terms of PXF, it's a java based framework to allow HAWQ to access data
> > files stored on external storage or locations which are not directly
> > managed by HAWQ system.
> >
> > For Hive ORC tables, first of all, PXF uses Hive APIs to access any file
> > format supported by Hive, so it doesn't matter if it's ORC or RC or
> Parquet
> > format you have in Hive. (PXF does provide a few *optimized* profile to
> > access certain formats though, see:
> >
> http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html
> > )
> >
> > The overall performance is determined by 1) Hive's API performance 2)
> PXF's
> > data retrieving, filtering, aggregation and sending back to HAWQ
> > HAWQ has no control of 1) but we can certainly discuss 2) if you see any
> > performance issues or improvements we can work on.
> >
> > -Goden
> >
> >
> > On Mon, Nov 2, 2015 at 8:46 AM Leon Zhang <leoncamel@gmail.com> wrote:
> >
> > > Hi, HAWQ dev,
> > >
> > >
> > > I am new to HAWQ, so I have some question about the design of PXF. As
> far
> > > as I know, PXF-service is a tomcat service, which serve external data
> > > source for HAWQ master in a RESTful way.
> > >
> > > My question is, will PXF-service become the bottleneck? Especially for
> the
> > > case of Hive ORC tables?
> > >
> > > Thanks.
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message