drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: [DISCUSS] Remove required type
Date Tue, 22 Mar 2016 20:36:37 GMT
My suggestion is we use explicit observation at the batch level. If there
are no nulls we can optimize this batch. This would ultimately improve over
our current situation where most parquet and all json data is nullable so
we don't optimize. I'd estimate that the vast majority of Drills workloads
are marked nullable whether they are or not. So what we're really
suggesting is deleting a bunch of code which is rarely in the execution
path.
On Mar 22, 2016 1:22 PM, "Aman Sinha" <amansinha@apache.org> wrote:

> I was thinking about it more after sending the previous concerns.  Agree,
> this is an execution side change...but some details need to be worked out.
> If the planner indicates to the executor that a column is non-nullable (e.g
> a primary key),  the run-time generated code is more efficient since it
> does not have to check the null bit.  Are you thinking we would use the
> existing nullable vector and add some additional metadata (at a record
> batch level rather than record level) to indicate non-nullability ?
>
>
> On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <jacques@dremio.com>
> wrote:
>
> > Hey Aman, I believe both Steven and I were only suggesting removal only
> > from execution, not planning. It seems like your concerns are all related
> > to planning. Iit seems like the real tradeoffs in execution are nominal.
> > On Mar 22, 2016 9:03 AM, "Aman Sinha" <amansinha@apache.org> wrote:
> >
> > > While it is true that there is code complexity due to the required
> type,
> > > what would we be trading off ?  some important considerations:
> > >   - We don't currently have null count statistics which would need to
> be
> > > implemented for various data sources
> > >   - Primary keys in the RDBMS sources (or rowkeys in hbase) are always
> > > non-null, and although today we may not be doing optimizations to
> > leverage
> > > that,  one could easily add a rule that converts  WHERE primary_key IS
> > NULL
> > > to a FALSE filter.
> > >
> > >
> > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
> doshinsky@commvault.com>
> > > wrote:
> > >
> > > > Hi Jacques,
> > > > Marginally related to this, I made a small change in PR-372
> > (DRILL-4184)
> > > > to support variable widths for decimal quantities in Parquet.  I
> found
> > > the
> > > > (decimal) vectoring code to be very difficult to understand (probably
> > > > because it's overly complex, but also because I'm new to Drill code
> in
> > > > general), so I made a small, surgical change in my pull request to
> > > support
> > > > keeping track of variable widths (lengths) and null booleans within
> the
> > > > existing fixed width decimal vectoring scheme.  Can my changes be
> > > > reviewed/accepted, and then we discuss how to fix properly long-term?
> > > >
> > > > Thanks,
> > > > Dave Oshinsky
> > > >
> > > > -----Original Message-----
> > > > From: Jacques Nadeau [mailto:jacques@dremio.com]
> > > > Sent: Monday, March 21, 2016 11:43 PM
> > > > To: dev
> > > > Subject: Re: [DISCUSS] Remove required type
> > > >
> > > > Definitely in support of this. The required type is a huge
> maintenance
> > > and
> > > > code complexity nightmare that provides little to no benefit. As you
> > > point
> > > > out, we can do better performance optimizations though null count
> > > > observation since most sources are nullable anyway.
> > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <steven@dremio.com>
> wrote:
> > > >
> > > > > I have been thinking about this for a while now, and I feel it
> would
> > > > > be a good idea to remove the Required vector types from Drill, and
> > > > > only use the Nullable version of vectors. I think this will greatly
> > > > simplify the code.
> > > > > It will also simplify the creation of UDFs. As is, if a function
> has
> > > > > custom null handling (i.e. INTERNAL), the function has to be
> > > > > separately implemented for each permutation of nullability of the
> > > > > inputs. But if drill data types are always nullable, this wouldn't
> > be a
> > > > problem.
> > > > >
> > > > > I don't think there would be much impact on performance. In
> practice,
> > > > > I think the required type is used very rarely. And there are other
> > > > > ways we can optimize for when a column is known to have no nulls.
> > > > >
> > > > > Thoughts?
> > > > >
> > > >
> > > >
> > > >
> > > > ***************************Legal
> Disclaimer***************************
> > > > "This communication may contain confidential and privileged material
> > for
> > > > the
> > > > sole use of the intended recipient. Any unauthorized review, use or
> > > > distribution
> > > > by others is strictly prohibited. If you have received the message by
> > > > mistake,
> > > > please advise the sender by reply email and delete the message. Thank
> > > you."
> > > >
> **********************************************************************
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message