drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hanifi GUNES ...@apache.org>
Subject Re: [DISCUSS] Remove required type
Date Tue, 22 Mar 2016 21:58:03 GMT
My major concern here too would be possible performance implications. That
being said, I can see ways to speed up execution relying on vector
density(for instance count) not sure how batch density would work. Perhaps
an example would throw some more light.

Why don't we think about some "good bad cases" to evaluate performance
impact? I wonder to which degree performance would degrade(if any) from
required to optional.

Also big chunk of code to handle required is already in. Any particular
reason to remove them?


-Hanifi


2016-03-22 13:36 GMT-07:00 Jacques Nadeau <jacques@dremio.com>:

> My suggestion is we use explicit observation at the batch level. If there
> are no nulls we can optimize this batch. This would ultimately improve over
> our current situation where most parquet and all json data is nullable so
> we don't optimize. I'd estimate that the vast majority of Drills workloads
> are marked nullable whether they are or not. So what we're really
> suggesting is deleting a bunch of code which is rarely in the execution
> path.
> On Mar 22, 2016 1:22 PM, "Aman Sinha" <amansinha@apache.org> wrote:
>
> > I was thinking about it more after sending the previous concerns.  Agree,
> > this is an execution side change...but some details need to be worked
> out.
> > If the planner indicates to the executor that a column is non-nullable
> (e.g
> > a primary key),  the run-time generated code is more efficient since it
> > does not have to check the null bit.  Are you thinking we would use the
> > existing nullable vector and add some additional metadata (at a record
> > batch level rather than record level) to indicate non-nullability ?
> >
> >
> > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <jacques@dremio.com>
> > wrote:
> >
> > > Hey Aman, I believe both Steven and I were only suggesting removal only
> > > from execution, not planning. It seems like your concerns are all
> related
> > > to planning. Iit seems like the real tradeoffs in execution are
> nominal.
> > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <amansinha@apache.org> wrote:
> > >
> > > > While it is true that there is code complexity due to the required
> > type,
> > > > what would we be trading off ?  some important considerations:
> > > >   - We don't currently have null count statistics which would need to
> > be
> > > > implemented for various data sources
> > > >   - Primary keys in the RDBMS sources (or rowkeys in hbase) are
> always
> > > > non-null, and although today we may not be doing optimizations to
> > > leverage
> > > > that,  one could easily add a rule that converts  WHERE primary_key
> IS
> > > NULL
> > > > to a FALSE filter.
> > > >
> > > >
> > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
> > doshinsky@commvault.com>
> > > > wrote:
> > > >
> > > > > Hi Jacques,
> > > > > Marginally related to this, I made a small change in PR-372
> > > (DRILL-4184)
> > > > > to support variable widths for decimal quantities in Parquet.  I
> > found
> > > > the
> > > > > (decimal) vectoring code to be very difficult to understand
> (probably
> > > > > because it's overly complex, but also because I'm new to Drill code
> > in
> > > > > general), so I made a small, surgical change in my pull request to
> > > > support
> > > > > keeping track of variable widths (lengths) and null booleans within
> > the
> > > > > existing fixed width decimal vectoring scheme.  Can my changes be
> > > > > reviewed/accepted, and then we discuss how to fix properly
> long-term?
> > > > >
> > > > > Thanks,
> > > > > Dave Oshinsky
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jacques Nadeau [mailto:jacques@dremio.com]
> > > > > Sent: Monday, March 21, 2016 11:43 PM
> > > > > To: dev
> > > > > Subject: Re: [DISCUSS] Remove required type
> > > > >
> > > > > Definitely in support of this. The required type is a huge
> > maintenance
> > > > and
> > > > > code complexity nightmare that provides little to no benefit. As
> you
> > > > point
> > > > > out, we can do better performance optimizations though null count
> > > > > observation since most sources are nullable anyway.
> > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <steven@dremio.com>
> > wrote:
> > > > >
> > > > > > I have been thinking about this for a while now, and I feel
it
> > would
> > > > > > be a good idea to remove the Required vector types from Drill,
> and
> > > > > > only use the Nullable version of vectors. I think this will
> greatly
> > > > > simplify the code.
> > > > > > It will also simplify the creation of UDFs. As is, if a function
> > has
> > > > > > custom null handling (i.e. INTERNAL), the function has to be
> > > > > > separately implemented for each permutation of nullability of
the
> > > > > > inputs. But if drill data types are always nullable, this
> wouldn't
> > > be a
> > > > > problem.
> > > > > >
> > > > > > I don't think there would be much impact on performance. In
> > practice,
> > > > > > I think the required type is used very rarely. And there are
> other
> > > > > > ways we can optimize for when a column is known to have no nulls.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ***************************Legal
> > Disclaimer***************************
> > > > > "This communication may contain confidential and privileged
> material
> > > for
> > > > > the
> > > > > sole use of the intended recipient. Any unauthorized review, use
or
> > > > > distribution
> > > > > by others is strictly prohibited. If you have received the message
> by
> > > > > mistake,
> > > > > please advise the sender by reply email and delete the message.
> Thank
> > > > you."
> > > > >
> > **********************************************************************
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message