drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: [DISCUSS] Remove required type
Date Wed, 23 Mar 2016 15:36:45 GMT
I agree that we should focus on real benefits versus theories.

Reduction in code complexity is a real benefit. Performance benefits from
having required types is theoretical. Dot drill files don't exist so they
should have little bearing on this conversation.

We rarely generate required data. Most tools never generate it. The reason
the question is about actual deployments is that would be a real factor to
counterbalance the drive for code simplification rather than something
theoretical. A theoretical future performance regression shouldn't stop
code improvement. If it did, we wouldn't make any progress.

What about your own internal benchmark tests. If removing required types
doesn't impact them, doesn't that mean this hasn't been a point of focus?
On Mar 22, 2016 8:36 PM, "Parth Chandra" <parthc@apache.org> wrote:

> I don't know if the main question is whether people have parquet (or other
> ) files which have required fields or not. With something like a dot drill
> file, a user can supply schema or format for data that does not carry
> schema, and we can certainly use the same to indicate knowledge of
> nullability. The question is whether we can take advantage of knowing
> whether data is null or not to get better performance.
>
> Any argument that applies to taking advantage of non-nullability at the
> batch level applies to taking advantage of non-nullability at the schema
> level.
>
> I'm not entirely convinced that the reduction of code complexity is
> ultimately leading to performance gain. Sure, it improves maintainability,
> but what specific improvements are you thinking of that will increase
> performance?
>
> If you recommend some areas of improvement that become possible as a result
> of this change, then I would suggest we run some experiments before we make
> any change.
>
> It is a capital mistake to theorize before one has data, etc...
>
> A 15% performance drop is not something to be ignored, I would think.
>
> Parth
>
> On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau <jacques@dremio.com>
> wrote:
> >
> > Re Performance:
> >
> > I think the main question is what portion of people's data is actually
> > marked as non-nullable in Parquet files? (We already treat json, avro,
> > kudu, and hbase (except row key) as nullable. We do treat csv as
> > non-nullable (array) but I think these workloads start with conversion to
> > Parquet.)  Early on, we typically benchmarked Drill using required fields
> > in Parquet. At the time, we actually hacked the Pig code to get something
> > to even generate this format. (I believe, to this day, Pig only generates
> > nullable fields in Parquet.) After some time, we recognized that
> basically
> > every tool was producing Parquet files that were nullable and ultimately
> > moved the benchmark infrastructure to using nullable types to do a better
> > job of representing real-world workloads.
> >
> > Based on my (fuzzy) recollection, working with nullable types had a
> 10-15%
> > performance impact versus working on required types so I think there is a
> > performance impact but I think the population of users who have
> > non-nullable Parquet files are small. If I recall, I believe Impala also
> > creates nullable Parquet files. Not sure what Spark does. I believe Hive
> > has also made this change recently or is doing it (deprecating non-nulls
> in
> > their internals).
> >
> > If we move forward with this, I would expect there initially would be a
> > decrease in performance when data is held as non-nullable given we
> > previously observed this. However, I believe the reduction in code
> > complexity would leads us to improve other things more quickly. Which
> leads
> > me to...
> >
> > Re: Why
> >
> > Drill suffers from code complexity. This hurts forward progress. One
> > example is the fact that we have to generate all nullable permutations of
> > functions. (For example, if we have three arguments, we have to generate
> 8
> > separate functions to work with the combination of argument
> nullabilities).
> > This leads to vastly more reliance on compile-time templating which is a
> > maintenance headache. Additionally, it makes the runtime code generation
> > more complicated and error prone. Testing is also more expensive because
> we
> > now have twice as many paths to both validate and maintain.
> Realistically,
> > we should try to move to more columnar algorithms, which would provide a
> > bigger lift than this declared schema nullability optimization. This is
> > because many workloads have rare nulls so we can actually optimize better
> > at the batch level. Creating three code paths (nullable observed
> non-null,
> > nullable observed null and non-null) make this substantially more
> > complicated. We want to invest in this area but the code complexity of
> > nullable versus required makes this tasks less likely to happen/harder.
> So,
> > in essence, I'm arguing that it is a small short-term loss that leads to
> > better code quality and ultimately faster performance.
> >
> > Do others have real-world observations on the frequency of required
> fields
> > in Parquet files?
> >
> > thanks,
> > Jacques
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra <parthc@apache.org>
> wrote:
> >
> > > I'm not entirely convinced that this would have no performance impact.
> Do
> > > we have any experiments?
> > >
> > >
> > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <jacques@dremio.com>
> > > wrote:
> > >
> > > > My suggestion is we use explicit observation at the batch level. If
> there
> > > > are no nulls we can optimize this batch. This would ultimately
> improve
> > > over
> > > > our current situation where most parquet and all json data is
> nullable so
> > > > we don't optimize. I'd estimate that the vast majority of Drills
> > > workloads
> > > > are marked nullable whether they are or not. So what we're really
> > > > suggesting is deleting a bunch of code which is rarely in the
> execution
> > > > path.
> > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" <amansinha@apache.org> wrote:
> > > >
> > > > > I was thinking about it more after sending the previous concerns.
> > > Agree,
> > > > > this is an execution side change...but some details need to be
> worked
> > > > out.
> > > > > If the planner indicates to the executor that a column is
> non-nullable
> > > > (e.g
> > > > > a primary key),  the run-time generated code is more efficient
> since it
> > > > > does not have to check the null bit.  Are you thinking we would use
> the
> > > > > existing nullable vector and add some additional metadata (at a
> record
> > > > > batch level rather than record level) to indicate non-nullability
?
> > > > >
> > > > >
> > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <
> jacques@dremio.com
> >
> > > > > wrote:
> > > > >
> > > > > > Hey Aman, I believe both Steven and I were only suggesting
> removal
> > > only
> > > > > > from execution, not planning. It seems like your concerns are
all
> > > > related
> > > > > > to planning. Iit seems like the real tradeoffs in execution
are
> > > > nominal.
> > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <amansinha@apache.org>
> wrote:
> > > > > >
> > > > > > > While it is true that there is code complexity due to the
> required
> > > > > type,
> > > > > > > what would we be trading off ?  some important considerations:
> > > > > > >   - We don't currently have null count statistics which
would
> need
> > > to
> > > > > be
> > > > > > > implemented for various data sources
> > > > > > >   - Primary keys in the RDBMS sources (or rowkeys in hbase)
are
> > > > always
> > > > > > > non-null, and although today we may not be doing optimizations
> to
> > > > > > leverage
> > > > > > > that,  one could easily add a rule that converts  WHERE
> primary_key
> > > > IS
> > > > > > NULL
> > > > > > > to a FALSE filter.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
> > > > > doshinsky@commvault.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jacques,
> > > > > > > > Marginally related to this, I made a small change
in PR-372
> > > > > > (DRILL-4184)
> > > > > > > > to support variable widths for decimal quantities
in Parquet.
>  I
> > > > > found
> > > > > > > the
> > > > > > > > (decimal) vectoring code to be very difficult to understand
> > > > (probably
> > > > > > > > because it's overly complex, but also because I'm
new to
> Drill
> > > code
> > > > > in
> > > > > > > > general), so I made a small, surgical change in my
pull
> request
> > > to
> > > > > > > support
> > > > > > > > keeping track of variable widths (lengths) and null
booleans
> > > within
> > > > > the
> > > > > > > > existing fixed width decimal vectoring scheme.  Can
my
> changes be
> > > > > > > > reviewed/accepted, and then we discuss how to fix
properly
> > > > long-term?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Dave Oshinsky
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jacques Nadeau [mailto:jacques@dremio.com]
> > > > > > > > Sent: Monday, March 21, 2016 11:43 PM
> > > > > > > > To: dev
> > > > > > > > Subject: Re: [DISCUSS] Remove required type
> > > > > > > >
> > > > > > > > Definitely in support of this. The required type is
a huge
> > > > > maintenance
> > > > > > > and
> > > > > > > > code complexity nightmare that provides little to
no benefit.
> As
> > > > you
> > > > > > > point
> > > > > > > > out, we can do better performance optimizations though
null
> count
> > > > > > > > observation since most sources are nullable anyway.
> > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <
> steven@dremio.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > I have been thinking about this for a while now,
and I feel
> it
> > > > > would
> > > > > > > > > be a good idea to remove the Required vector
types from
> Drill,
> > > > and
> > > > > > > > > only use the Nullable version of vectors. I think
this will
> > > > greatly
> > > > > > > > simplify the code.
> > > > > > > > > It will also simplify the creation of UDFs. As
is, if a
> > > function
> > > > > has
> > > > > > > > > custom null handling (i.e. INTERNAL), the function
has to
> be
> > > > > > > > > separately implemented for each permutation of
nullability
> of
> > > the
> > > > > > > > > inputs. But if drill data types are always nullable,
this
> > > > wouldn't
> > > > > > be a
> > > > > > > > problem.
> > > > > > > > >
> > > > > > > > > I don't think there would be much impact on performance.
In
> > > > > practice,
> > > > > > > > > I think the required type is used very rarely.
And there
> are
> > > > other
> > > > > > > > > ways we can optimize for when a column is known
to have no
> > > nulls.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ***************************Legal
> > > > > Disclaimer***************************
> > > > > > > > "This communication may contain confidential and privileged
> > > > material
> > > > > > for
> > > > > > > > the
> > > > > > > > sole use of the intended recipient. Any unauthorized
review,
> use
> > > or
> > > > > > > > distribution
> > > > > > > > by others is strictly prohibited. If you have received
the
> > > message
> > > > by
> > > > > > > > mistake,
> > > > > > > > please advise the sender by reply email and delete
the
> message.
> > > > Thank
> > > > > > > you."
> > > > > > > >
> > > > >
> **********************************************************************
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message