hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jemish Patel <jemi...@gmail.com>
Subject Re: [Propose] More data skipping technology for IO intensive performance enhancement
Date Tue, 05 Jul 2016 17:33:00 GMT
+1 Great idea.

Jemish


On Tue, Jul 5, 2016 at 10:16 AM Shivram Mani <shivram.mani@gmail.com> wrote:

> +1 on any form of predicate pushdown. Query planning/optimizer will have to
> be modified with the modified cost of plans with the reduced data.
>
> On Tue, Jul 5, 2016 at 9:37 AM, Kavinder Dhaliwal <kdhaliwal@pivotal.io>
> wrote:
>
> > This is an excellent idea that bring HAWQ up to speed with comparable
> > databases (Presto, Impala). In addition to taking advantage of the stats
> > available in file formats like ORC, HAWQ should also transition to
> > vectorized reading of files as this also provides a performance boost.
> The
> > newer Apache ORC library only supports vectorized reading so HAWQ should
> > also adopt these new methods.
> >
> > Kavinder
> >
> > On Mon, Jul 4, 2016 at 2:32 AM, Lei Chang <lei_chang@apache.org> wrote:
> >
> > > Good idea. I think it can potentially increase the performance of IO
> > bound
> > > workload.
> > >
> > > Cheers
> > > Lei
> > >
> > >
> > > On Sat, Jul 2, 2016 at 11:19 PM, Ming Li <mli@pivotal.io> wrote:
> > >
> > > > Data skipping technology can extremely avoiding unnecessary IO,  so
> it
> > > can
> > > > extremely enhance performance for IO intensive query. Including
> > > eliminating
> > > > query on unnecessary table partition according to the partition key
> > > range ,
> > > > I think more options are available now:
> > > >
> > > > (1) Parquet / ORC format introduce a lightweight meta data info like
> > > > Min/Max/Bloom filter for each block, such meta data can be exploited
> > when
> > > > predicate/filter info can be fetched before executing scan.
> > > >
> > > > However now in HAWQ, all data in parquet need to be scanned into
> memory
> > > > before processing predicate/filter. We don't generate the meta info
> > when
> > > > INSERT into parquet table, the scan executor doesn't utilize the meta
> > > info
> > > > neither. Maybe some scan API need to be refactored so that we can get
> > > > predicate/filter
> > > > info before executing base relation scan.
> > > >
> > > > (2) Base on (1) technology,  especially with Bloom filter, more
> > optimizer
> > > > technology can be explored furthur. E.g. Impala implemented Runtime
> > > > filtering(*
> > > >
> > >
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > > > <
> > > >
> > >
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > > > >*
> > > > ),  which can be used at
> > > > - dynamic partition pruning
> > > > - converting join predicate to base relation predicate
> > > >
> > > > It tell the executor to wait for one moment(the interval time can be
> > set
> > > in
> > > > guc) before executing base relation scan, if the interested
> values(e.g.
> > > the
> > > > column in join predicate only have very small set) arrived in time,
> it
> > > can
> > > > use these value to filter this scan, if doesn't arrived in time, it
> > scan
> > > > without this filter, which doesn't impact result correctness.
> > > >
> > > > Unlike (1) technology, this technology cannot be used in any case, it
> > > only
> > > > outperform in some cases. So it just add some more query plan
> > > > choices/paths, and the optimizer need based on statistics info to
> > > calculate
> > > > the cost, and apply it when cost down.
> > > >
> > > > All in one, maybe more similar technology can be adoptable for HAWQ
> > now,
> > > > let's start to think about performance related technology, moreover
> we
> > > need
> > > > to instigate how these technology can be implemented in HAWQ.
> > > >
> > > > Any ideas or suggestions are welcomed? Thanks.
> > > >
> > >
> >
>
>
>
> --
> shivram mani
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message