hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kavinder Dhaliwal <kdhali...@pivotal.io>
Subject Re: [Propose] More data skipping technology for IO intensive performance enhancement
Date Tue, 05 Jul 2016 16:37:21 GMT
This is an excellent idea that bring HAWQ up to speed with comparable
databases (Presto, Impala). In addition to taking advantage of the stats
available in file formats like ORC, HAWQ should also transition to
vectorized reading of files as this also provides a performance boost. The
newer Apache ORC library only supports vectorized reading so HAWQ should
also adopt these new methods.

Kavinder

On Mon, Jul 4, 2016 at 2:32 AM, Lei Chang <lei_chang@apache.org> wrote:

> Good idea. I think it can potentially increase the performance of IO bound
> workload.
>
> Cheers
> Lei
>
>
> On Sat, Jul 2, 2016 at 11:19 PM, Ming Li <mli@pivotal.io> wrote:
>
> > Data skipping technology can extremely avoiding unnecessary IO,  so it
> can
> > extremely enhance performance for IO intensive query. Including
> eliminating
> > query on unnecessary table partition according to the partition key
> range ,
> > I think more options are available now:
> >
> > (1) Parquet / ORC format introduce a lightweight meta data info like
> > Min/Max/Bloom filter for each block, such meta data can be exploited when
> > predicate/filter info can be fetched before executing scan.
> >
> > However now in HAWQ, all data in parquet need to be scanned into memory
> > before processing predicate/filter. We don't generate the meta info when
> > INSERT into parquet table, the scan executor doesn't utilize the meta
> info
> > neither. Maybe some scan API need to be refactored so that we can get
> > predicate/filter
> > info before executing base relation scan.
> >
> > (2) Base on (1) technology,  especially with Bloom filter, more optimizer
> > technology can be explored furthur. E.g. Impala implemented Runtime
> > filtering(*
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > <
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > >*
> > ),  which can be used at
> > - dynamic partition pruning
> > - converting join predicate to base relation predicate
> >
> > It tell the executor to wait for one moment(the interval time can be set
> in
> > guc) before executing base relation scan, if the interested values(e.g.
> the
> > column in join predicate only have very small set) arrived in time, it
> can
> > use these value to filter this scan, if doesn't arrived in time, it scan
> > without this filter, which doesn't impact result correctness.
> >
> > Unlike (1) technology, this technology cannot be used in any case, it
> only
> > outperform in some cases. So it just add some more query plan
> > choices/paths, and the optimizer need based on statistics info to
> calculate
> > the cost, and apply it when cost down.
> >
> > All in one, maybe more similar technology can be adoptable for HAWQ now,
> > let's start to think about performance related technology, moreover we
> need
> > to instigate how these technology can be implemented in HAWQ.
> >
> > Any ideas or suggestions are welcomed? Thanks.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message