hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lili Ma <...@pivotal.io>
Subject Re: [Propose] More data skipping technology for IO intensive performance enhancement
Date Thu, 07 Jul 2016 02:16:35 GMT
What about we work out a draft design describing how to implement data
skipping technology for HAWQ?


Thanks
Lili

On Wed, Jul 6, 2016 at 7:23 PM, Gmail <xunzhangthu@gmail.com> wrote:

> BTW, could you create some related issues in JIRA?
>
> Thanks
> xunzhang
>
> Send from my iPhone
>
> > 在 2016年7月2日,23:19,Ming Li <mli@pivotal.io> 写道:
> >
> > Data skipping technology can extremely avoiding unnecessary IO,  so it
> can
> > extremely enhance performance for IO intensive query. Including
> eliminating
> > query on unnecessary table partition according to the partition key
> range ,
> > I think more options are available now:
> >
> > (1) Parquet / ORC format introduce a lightweight meta data info like
> > Min/Max/Bloom filter for each block, such meta data can be exploited when
> > predicate/filter info can be fetched before executing scan.
> >
> > However now in HAWQ, all data in parquet need to be scanned into memory
> > before processing predicate/filter. We don't generate the meta info when
> > INSERT into parquet table, the scan executor doesn't utilize the meta
> info
> > neither. Maybe some scan API need to be refactored so that we can get
> > predicate/filter
> > info before executing base relation scan.
> >
> > (2) Base on (1) technology,  especially with Bloom filter, more optimizer
> > technology can be explored furthur. E.g. Impala implemented Runtime
> > filtering(*
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > <
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> >*
> > ),  which can be used at
> > - dynamic partition pruning
> > - converting join predicate to base relation predicate
> >
> > It tell the executor to wait for one moment(the interval time can be set
> in
> > guc) before executing base relation scan, if the interested values(e.g.
> the
> > column in join predicate only have very small set) arrived in time, it
> can
> > use these value to filter this scan, if doesn't arrived in time, it scan
> > without this filter, which doesn't impact result correctness.
> >
> > Unlike (1) technology, this technology cannot be used in any case, it
> only
> > outperform in some cases. So it just add some more query plan
> > choices/paths, and the optimizer need based on statistics info to
> calculate
> > the cost, and apply it when cost down.
> >
> > All in one, maybe more similar technology can be adoptable for HAWQ now,
> > let's start to think about performance related technology, moreover we
> need
> > to instigate how these technology can be implemented in HAWQ.
> >
> > Any ideas or suggestions are welcomed? Thanks.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message