drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Gilmore <dragoncu...@gmail.com>
Subject Re: Parquet pushdown filtering
Date Mon, 14 Dec 2015 11:33:56 GMT
Shall we say 10am my time, 4pm your time?

On Sunday, 13 December 2015, Julien Le Dem <julien@dremio.com> wrote:

> Tuesday morning in Australia, Monday afternoon in California sounds good to
> me.
>
> On Fri, Dec 11, 2015 at 11:42 AM, Parth Chandra <parthc@apache.org
> <javascript:;>> wrote:
>
> > I'd like to attend as well. Any time that works for Julien/Jason works
> for
> > me.
> >
> >
> >
> >
> >
> > On Thu, Dec 10, 2015 at 6:15 PM, Adam Gilmore <dragoncurve@gmail.com
> <javascript:;>>
> > wrote:
> >
> > > Could we say Monday or Tuesday next week?  I'm actually ahead of you
> guys
> > > by about 18 hours, so Monday morning my time would be Sunday
> > > afternoon/evening for you.  If that doesn't work, what about Tuesday
> > > morning my time - Monday afternoon/evening your time?
> > >
> > > On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse <
> > altekrusejason@gmail.com <javascript:;>
> > > >
> > > wrote:
> > >
> > > > I can also join for this meeting, Julien and I are both on SF time.
> > Looks
> > > > like you are about 5-6 hours behind us, so depending on if you would
> > > prefer
> > > > morning or afternoon we'll just be a little further into our days.
> > > >
> > > > On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <dragoncurve@gmail.com
> <javascript:;>>
> > > > wrote:
> > > >
> > > > > ​Sure - I'm in Australia so I'm not sure how the timezones will
> work
> > > for
> > > > > you guys, but I'm pretty flexible.  Where are you located?​
> > > > >
> > > > > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <julien@dremio.com
> <javascript:;>>
> > > wrote:
> > > > >
> > > > > > Adam: do you want to schedule a hangout?
> > > > > >
> > > > > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <
> > dragoncurve@gmail.com <javascript:;>>
> > > > > > wrote:
> > > > > >
> > > > > > > That makes sense​​, yep.  The problem is I guess with
my
> > > > > > implementation.  I
> > > > > > > will iterate through all Parquet files and try to eliminate
> ones
> > > > where
> > > > > > the
> > > > > > > filter conflicts with the statistics.  In instances where
no
> > files
> > > > > match
> > > > > > > the filter, I end up with an empty set of files for the
Parquet
> > > scan
> > > > to
> > > > > > > iterate through.  I suppose I could just pick the schema
of the
> > > first
> > > > > > file
> > > > > > > or something, but that seems like a pretty messy rule.
> > > > > > >
> > > > > > > Julien - I'd be happy to have a chat about this.  I've
pretty
> > much
> > > > got
> > > > > > the
> > > > > > > implementation down, but need to solve a few of these little
> > > issues.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <
> > > hanifigunes@gmail.com <javascript:;>>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Regarding your point  #1. I guess Daniel struggled
with this
> > > > > limitation
> > > > > > > as
> > > > > > > > well. I merged few of his patches which addressed
empty
> > batch(no
> > > > > data)
> > > > > > > > handling in various places during execution. That
said,
> > however,
> > > we
> > > > > > still
> > > > > > > > could not have time to develop a solid way to handle
empty
> > > batches
> > > > > with
> > > > > > > no
> > > > > > > > schema.
> > > > > > > >
> > > > > > > > *- Scan batches don't allow empty batches.  This means
if a
> > > > > > > > particular filter filters out *all* rows, we get an
> exception.*
> > > > > > > > Looks to me, you are referring to no data rather than
no
> schema
> > > > > here. I
> > > > > > > > would expect graceful execution in this case. Do you
mind
> > > sharing a
> > > > > > > simple
> > > > > > > > reproduction?
> > > > > > > >
> > > > > > > >
> > > > > > > > -Hanifi
> > > > > > > >
> > > > > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <julien@dremio.com
> <javascript:;>>:
> > > > > > > >
> > > > > > > > > Hey Adam,
> > > > > > > > > If you have questions about the Parquet side
of things, I'm
> > > happy
> > > > > to
> > > > > > > > chat.
> > > > > > > > > Julien
> > > > > > > > >
> > > > > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra
<
> > > > parthc@apache.org <javascript:;>>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Parquet metadata has the rowCount for every
rowGroup
> which
> > is
> > > > > also
> > > > > > > the
> > > > > > > > > > value count for every column in the rowGroup.
Isn't that
> > what
> > > > you
> > > > > > > need?
> > > > > > > > > >
> > > > > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore
<
> > > > > > dragoncurve@gmail.com <javascript:;>
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi guys,
> > > > > > > > > > >
> > > > > > > > > > > I'm trying to (re)implement pushdown
filtering for
> > Parquet
> > > > with
> > > > > > the
> > > > > > > > new
> > > > > > > > > > > Parquet metadata caching implementation.
> > > > > > > > > > >
> > > > > > > > > > > I've run into a couple of challenges:
> > > > > > > > > > >
> > > > > > > > > > >    1. Scan batches don't allow empty
batches.  This
> means
> > > if
> > > > a
> > > > > > > > > particular
> > > > > > > > > > >    filter filters out *all* rows, we
get an
> exception.  I
> > > > > haven't
> > > > > > > > read
> > > > > > > > > > the
> > > > > > > > > > >    full comments on the relevant JIRA
items, but it
> seems
> > > odd
> > > > > > that
> > > > > > > we
> > > > > > > > > > can't
> > > > > > > > > > >    query an empty JSON file, for example.
 This is a
> bit
> > > of a
> > > > > > > blocker
> > > > > > > > > to
> > > > > > > > > > >    implement the pushdown filtering
properly.
> > > > > > > > > > >    2. The Parquet metadata doesn't
include all the
> > relevant
> > > > > > > metadata.
> > > > > > > > > > >    Specifically, count of values is
not included,
> > therefore
> > > > the
> > > > > > > > default
> > > > > > > > > > >    Parquet statistics filter has issues
because it
> > compares
> > > > the
> > > > > > > count
> > > > > > > > > of
> > > > > > > > > > >    values with count of nulls to work
out if it can
> drop
> > > it.
> > > > > > This
> > > > > > > > > isn't
> > > > > > > > > > >    necessarily a blocker, but it feels
ugly simulating
> > > > there's
> > > > > > "1"
> > > > > > > > row
> > > > > > > > > > in a
> > > > > > > > > > >    block (just to get around the null
comparison).
> > > > > > > > > > >
> > > > > > > > > > > Also, it feels a bit ugly rehydrating
the standard
> > Parquet
> > > > > > metadata
> > > > > > > > > > objects
> > > > > > > > > > > manually.  I'm not sure I understand
why we created our
> > own
> > > > > > objects
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > Parquet metadata as opposed to simply
writing a custom
> > > > > serializer
> > > > > > > for
> > > > > > > > > > those
> > > > > > > > > > > objects which we store.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts would be great - I'd love
to get a patch out
> for
> > > > this.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Julien
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Julien
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message