drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: [DISCUSS] Remove required type
Date Wed, 23 Mar 2016 21:34:08 GMT
If the function is strict (i.e. produces null result if any of the
arguments are null) then the user needs to provide only one
implementation: the one where all arguments are not null. Drill would
check for null arguments before calling it.

Most of the SQL built-in functions are strict. Aren't most many user
functions also?

On Wed, Mar 23, 2016 at 1:48 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:
> If the main complexity comes from UDF, i.e, user has to implement each
> permutation of nullability, I feel it might be possible to allow users
> to only provide an implementation (nullable input), and Drill changes
> function resolution logic and run-time code-gen logic to make that
> work. That is, treat required input as nullable input for this UDF.
> This will restrict the performance impact to query "only" involving
> this UDF, not to any general query.
>
> User also has a choice of trade-off when they implement UDF: multiple
> implementations means better performance, single implementation means
> performance penalty.
>
>
>
> On Wed, Mar 23, 2016 at 11:58 AM, Parth Chandra <parthc@apache.org> wrote:
>> Hmm. I may not have expressed my thoughts clearly.
>> What I was suggesting was that 'non-null' data exists in all data sets. (I
>> have at least two data sets from users with Drill in production (sorry,
>> cannot share the data), that have required fields in parquet files). The
>> fields may not be marked as such in the metadata, or the data source may
>> not have any such metadata, but if we can identify the type as non-null,
>> then we can (and should) take advantage of it.
>> If we are already taking advantage of it, then we should not make any
>> changes without understanding the tradeoffs.
>> So in the spirit of understanding that, I'd like to ask two questions -
>> 1) Where specifically are you suggesting code complexity will decrease. You
>> mentioned UDFs. Where else do you see the code is more complex?
>> 2) Do we have any experiments to show whether columnar processing benefits
>> from eliminating required fields?
>>
>>
>>
>> On Wed, Mar 23, 2016 at 8:36 AM, Jacques Nadeau <jacques@dremio.com> wrote:
>>
>>> I agree that we should focus on real benefits versus theories.
>>>
>>> Reduction in code complexity is a real benefit. Performance benefits from
>>> having required types is theoretical. Dot drill files don't exist so they
>>> should have little bearing on this conversation.
>>>
>>> We rarely generate required data. Most tools never generate it. The reason
>>> the question is about actual deployments is that would be a real factor to
>>> counterbalance the drive for code simplification rather than something
>>> theoretical. A theoretical future performance regression shouldn't stop
>>> code improvement. If it did, we wouldn't make any progress.
>>>
>>> What about your own internal benchmark tests. If removing required types
>>> doesn't impact them, doesn't that mean this hasn't been a point of focus?
>>> On Mar 22, 2016 8:36 PM, "Parth Chandra" <parthc@apache.org> wrote:
>>>
>>> > I don't know if the main question is whether people have parquet (or
>>> other
>>> > ) files which have required fields or not. With something like a dot
>>> drill
>>> > file, a user can supply schema or format for data that does not carry
>>> > schema, and we can certainly use the same to indicate knowledge of
>>> > nullability. The question is whether we can take advantage of knowing
>>> > whether data is null or not to get better performance.
>>> >
>>> > Any argument that applies to taking advantage of non-nullability at the
>>> > batch level applies to taking advantage of non-nullability at the schema
>>> > level.
>>> >
>>> > I'm not entirely convinced that the reduction of code complexity is
>>> > ultimately leading to performance gain. Sure, it improves
>>> maintainability,
>>> > but what specific improvements are you thinking of that will increase
>>> > performance?
>>> >
>>> > If you recommend some areas of improvement that become possible as a
>>> result
>>> > of this change, then I would suggest we run some experiments before we
>>> make
>>> > any change.
>>> >
>>> > It is a capital mistake to theorize before one has data, etc...
>>> >
>>> > A 15% performance drop is not something to be ignored, I would think.
>>> >
>>> > Parth
>>> >
>>> > On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau <jacques@dremio.com>
>>> > wrote:
>>> > >
>>> > > Re Performance:
>>> > >
>>> > > I think the main question is what portion of people's data is actually
>>> > > marked as non-nullable in Parquet files? (We already treat json, avro,
>>> > > kudu, and hbase (except row key) as nullable. We do treat csv as
>>> > > non-nullable (array) but I think these workloads start with conversion
>>> to
>>> > > Parquet.)  Early on, we typically benchmarked Drill using required
>>> fields
>>> > > in Parquet. At the time, we actually hacked the Pig code to get
>>> something
>>> > > to even generate this format. (I believe, to this day, Pig only
>>> generates
>>> > > nullable fields in Parquet.) After some time, we recognized that
>>> > basically
>>> > > every tool was producing Parquet files that were nullable and
>>> ultimately
>>> > > moved the benchmark infrastructure to using nullable types to do a
>>> better
>>> > > job of representing real-world workloads.
>>> > >
>>> > > Based on my (fuzzy) recollection, working with nullable types had a
>>> > 10-15%
>>> > > performance impact versus working on required types so I think there
>>> is a
>>> > > performance impact but I think the population of users who have
>>> > > non-nullable Parquet files are small. If I recall, I believe Impala
>>> also
>>> > > creates nullable Parquet files. Not sure what Spark does. I believe
>>> Hive
>>> > > has also made this change recently or is doing it (deprecating
>>> non-nulls
>>> > in
>>> > > their internals).
>>> > >
>>> > > If we move forward with this, I would expect there initially would
be a
>>> > > decrease in performance when data is held as non-nullable given we
>>> > > previously observed this. However, I believe the reduction in code
>>> > > complexity would leads us to improve other things more quickly. Which
>>> > leads
>>> > > me to...
>>> > >
>>> > > Re: Why
>>> > >
>>> > > Drill suffers from code complexity. This hurts forward progress. One
>>> > > example is the fact that we have to generate all nullable permutations
>>> of
>>> > > functions. (For example, if we have three arguments, we have to
>>> generate
>>> > 8
>>> > > separate functions to work with the combination of argument
>>> > nullabilities).
>>> > > This leads to vastly more reliance on compile-time templating which
is
>>> a
>>> > > maintenance headache. Additionally, it makes the runtime code
>>> generation
>>> > > more complicated and error prone. Testing is also more expensive
>>> because
>>> > we
>>> > > now have twice as many paths to both validate and maintain.
>>> > Realistically,
>>> > > we should try to move to more columnar algorithms, which would provide
>>> a
>>> > > bigger lift than this declared schema nullability optimization. This
is
>>> > > because many workloads have rare nulls so we can actually optimize
>>> better
>>> > > at the batch level. Creating three code paths (nullable observed
>>> > non-null,
>>> > > nullable observed null and non-null) make this substantially more
>>> > > complicated. We want to invest in this area but the code complexity
of
>>> > > nullable versus required makes this tasks less likely to happen/harder.
>>> > So,
>>> > > in essence, I'm arguing that it is a small short-term loss that leads
>>> to
>>> > > better code quality and ultimately faster performance.
>>> > >
>>> > > Do others have real-world observations on the frequency of required
>>> > fields
>>> > > in Parquet files?
>>> > >
>>> > > thanks,
>>> > > Jacques
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Jacques Nadeau
>>> > > CTO and Co-Founder, Dremio
>>> > >
>>> > > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra <parthc@apache.org>
>>> > wrote:
>>> > >
>>> > > > I'm not entirely convinced that this would have no performance
>>> impact.
>>> > Do
>>> > > > we have any experiments?
>>> > > >
>>> > > >
>>> > > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <jacques@dremio.com>
>>> > > > wrote:
>>> > > >
>>> > > > > My suggestion is we use explicit observation at the batch
level. If
>>> > there
>>> > > > > are no nulls we can optimize this batch. This would ultimately
>>> > improve
>>> > > > over
>>> > > > > our current situation where most parquet and all json data
is
>>> > nullable so
>>> > > > > we don't optimize. I'd estimate that the vast majority of
Drills
>>> > > > workloads
>>> > > > > are marked nullable whether they are or not. So what we're
really
>>> > > > > suggesting is deleting a bunch of code which is rarely in
the
>>> > execution
>>> > > > > path.
>>> > > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" <amansinha@apache.org>
>>> wrote:
>>> > > > >
>>> > > > > > I was thinking about it more after sending the previous
concerns.
>>> > > > Agree,
>>> > > > > > this is an execution side change...but some details
need to be
>>> > worked
>>> > > > > out.
>>> > > > > > If the planner indicates to the executor that a column
is
>>> > non-nullable
>>> > > > > (e.g
>>> > > > > > a primary key),  the run-time generated code is more
efficient
>>> > since it
>>> > > > > > does not have to check the null bit.  Are you thinking
we would
>>> use
>>> > the
>>> > > > > > existing nullable vector and add some additional metadata
(at a
>>> > record
>>> > > > > > batch level rather than record level) to indicate
>>> non-nullability ?
>>> > > > > >
>>> > > > > >
>>> > > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <
>>> > jacques@dremio.com
>>> > >
>>> > > > > > wrote:
>>> > > > > >
>>> > > > > > > Hey Aman, I believe both Steven and I were only
suggesting
>>> > removal
>>> > > > only
>>> > > > > > > from execution, not planning. It seems like your
concerns are
>>> all
>>> > > > > related
>>> > > > > > > to planning. Iit seems like the real tradeoffs
in execution are
>>> > > > > nominal.
>>> > > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <amansinha@apache.org>
>>> > wrote:
>>> > > > > > >
>>> > > > > > > > While it is true that there is code complexity
due to the
>>> > required
>>> > > > > > type,
>>> > > > > > > > what would we be trading off ?  some important
>>> considerations:
>>> > > > > > > >   - We don't currently have null count statistics
which would
>>> > need
>>> > > > to
>>> > > > > > be
>>> > > > > > > > implemented for various data sources
>>> > > > > > > >   - Primary keys in the RDBMS sources (or
rowkeys in hbase)
>>> are
>>> > > > > always
>>> > > > > > > > non-null, and although today we may not be
doing
>>> optimizations
>>> > to
>>> > > > > > > leverage
>>> > > > > > > > that,  one could easily add a rule that converts
 WHERE
>>> > primary_key
>>> > > > > IS
>>> > > > > > > NULL
>>> > > > > > > > to a FALSE filter.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky
<
>>> > > > > > doshinsky@commvault.com>
>>> > > > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > Hi Jacques,
>>> > > > > > > > > Marginally related to this, I made a
small change in PR-372
>>> > > > > > > (DRILL-4184)
>>> > > > > > > > > to support variable widths for decimal
quantities in
>>> Parquet.
>>> >  I
>>> > > > > > found
>>> > > > > > > > the
>>> > > > > > > > > (decimal) vectoring code to be very difficult
to understand
>>> > > > > (probably
>>> > > > > > > > > because it's overly complex, but also
because I'm new to
>>> > Drill
>>> > > > code
>>> > > > > > in
>>> > > > > > > > > general), so I made a small, surgical
change in my pull
>>> > request
>>> > > > to
>>> > > > > > > > support
>>> > > > > > > > > keeping track of variable widths (lengths)
and null
>>> booleans
>>> > > > within
>>> > > > > > the
>>> > > > > > > > > existing fixed width decimal vectoring
scheme.  Can my
>>> > changes be
>>> > > > > > > > > reviewed/accepted, and then we discuss
how to fix properly
>>> > > > > long-term?
>>> > > > > > > > >
>>> > > > > > > > > Thanks,
>>> > > > > > > > > Dave Oshinsky
>>> > > > > > > > >
>>> > > > > > > > > -----Original Message-----
>>> > > > > > > > > From: Jacques Nadeau [mailto:jacques@dremio.com]
>>> > > > > > > > > Sent: Monday, March 21, 2016 11:43 PM
>>> > > > > > > > > To: dev
>>> > > > > > > > > Subject: Re: [DISCUSS] Remove required
type
>>> > > > > > > > >
>>> > > > > > > > > Definitely in support of this. The required
type is a huge
>>> > > > > > maintenance
>>> > > > > > > > and
>>> > > > > > > > > code complexity nightmare that provides
little to no
>>> benefit.
>>> > As
>>> > > > > you
>>> > > > > > > > point
>>> > > > > > > > > out, we can do better performance optimizations
though null
>>> > count
>>> > > > > > > > > observation since most sources are nullable
anyway.
>>> > > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips"
<
>>> > steven@dremio.com>
>>> > > > > > wrote:
>>> > > > > > > > >
>>> > > > > > > > > > I have been thinking about this
for a while now, and I
>>> feel
>>> > it
>>> > > > > > would
>>> > > > > > > > > > be a good idea to remove the Required
vector types from
>>> > Drill,
>>> > > > > and
>>> > > > > > > > > > only use the Nullable version of
vectors. I think this
>>> will
>>> > > > > greatly
>>> > > > > > > > > simplify the code.
>>> > > > > > > > > > It will also simplify the creation
of UDFs. As is, if a
>>> > > > function
>>> > > > > > has
>>> > > > > > > > > > custom null handling (i.e. INTERNAL),
the function has to
>>> > be
>>> > > > > > > > > > separately implemented for each
permutation of
>>> nullability
>>> > of
>>> > > > the
>>> > > > > > > > > > inputs. But if drill data types
are always nullable, this
>>> > > > > wouldn't
>>> > > > > > > be a
>>> > > > > > > > > problem.
>>> > > > > > > > > >
>>> > > > > > > > > > I don't think there would be much
impact on performance.
>>> In
>>> > > > > > practice,
>>> > > > > > > > > > I think the required type is used
very rarely. And there
>>> > are
>>> > > > > other
>>> > > > > > > > > > ways we can optimize for when a
column is known to have
>>> no
>>> > > > nulls.
>>> > > > > > > > > >
>>> > > > > > > > > > Thoughts?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > ***************************Legal
>>> > > > > > Disclaimer***************************
>>> > > > > > > > > "This communication may contain confidential
and privileged
>>> > > > > material
>>> > > > > > > for
>>> > > > > > > > > the
>>> > > > > > > > > sole use of the intended recipient. Any
unauthorized
>>> review,
>>> > use
>>> > > > or
>>> > > > > > > > > distribution
>>> > > > > > > > > by others is strictly prohibited. If
you have received the
>>> > > > message
>>> > > > > by
>>> > > > > > > > > mistake,
>>> > > > > > > > > please advise the sender by reply email
and delete the
>>> > message.
>>> > > > > Thank
>>> > > > > > > > you."
>>> > > > > > > > >
>>> > > > > >
>>> > **********************************************************************
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> >
>>>

Mime
View raw message