drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hanifi GUNES <hanifigu...@gmail.com>
Subject Re: Parquet pushdown filtering
Date Thu, 03 Dec 2015 19:22:33 GMT
Regarding your point  #1. I guess Daniel struggled with this limitation as
well. I merged few of his patches which addressed empty batch(no data)
handling in various places during execution. That said, however, we still
could not have time to develop a solid way to handle empty batches with no
schema.

*- Scan batches don't allow empty batches.  This means if a
particular filter filters out *all* rows, we get an exception.*
Looks to me, you are referring to no data rather than no schema here. I
would expect graceful execution in this case. Do you mind sharing a simple
reproduction?


-Hanifi

2015-12-03 10:56 GMT-08:00 Julien Le Dem <julien@dremio.com>:

> Hey Adam,
> If you have questions about the Parquet side of things, I'm happy to chat.
> Julien
>
> On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <parthc@apache.org> wrote:
>
> > Parquet metadata has the rowCount for every rowGroup which is also the
> > value count for every column in the rowGroup. Isn't that what you need?
> >
> > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dragoncurve@gmail.com>
> > wrote:
> >
> > > Hi guys,
> > >
> > > I'm trying to (re)implement pushdown filtering for Parquet with the new
> > > Parquet metadata caching implementation.
> > >
> > > I've run into a couple of challenges:
> > >
> > >    1. Scan batches don't allow empty batches.  This means if a
> particular
> > >    filter filters out *all* rows, we get an exception.  I haven't read
> > the
> > >    full comments on the relevant JIRA items, but it seems odd that we
> > can't
> > >    query an empty JSON file, for example.  This is a bit of a blocker
> to
> > >    implement the pushdown filtering properly.
> > >    2. The Parquet metadata doesn't include all the relevant metadata.
> > >    Specifically, count of values is not included, therefore the default
> > >    Parquet statistics filter has issues because it compares the count
> of
> > >    values with count of nulls to work out if it can drop it.  This
> isn't
> > >    necessarily a blocker, but it feels ugly simulating there's "1" row
> > in a
> > >    block (just to get around the null comparison).
> > >
> > > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> > objects
> > > manually.  I'm not sure I understand why we created our own objects for
> > the
> > > Parquet metadata as opposed to simply writing a custom serializer for
> > those
> > > objects which we store.
> > >
> > > Thoughts would be great - I'd love to get a patch out for this.
> > >
> >
>
>
>
> --
> Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message