Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ACA3319551 for ; Thu, 24 Mar 2016 15:48:17 +0000 (UTC) Received: (qmail 48012 invoked by uid 500); 24 Mar 2016 15:48:17 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 47959 invoked by uid 500); 24 Mar 2016 15:48:17 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 47947 invoked by uid 99); 24 Mar 2016 15:48:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2016 15:48:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 97309180219 for ; Thu, 24 Mar 2016 15:48:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=dremio-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id dTDsiz2OGn0F for ; Thu, 24 Mar 2016 15:48:09 +0000 (UTC) Received: from mail-lb0-f173.google.com (mail-lb0-f173.google.com [209.85.217.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 467FC5F610 for ; Thu, 24 Mar 2016 15:48:08 +0000 (UTC) Received: by mail-lb0-f173.google.com with SMTP id bc4so33609430lbc.2 for ; Thu, 24 Mar 2016 08:48:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dremio-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=2XpKbXbAvctjhAtZ54apbx6nEQ6a3s5xJ5IE3TOd7qM=; b=Di2Ksqil5Ra9RukRkMuLkq69QgPrm0Odk1j2zznmDos6khohij6+Ap9Za0SgXMO7w7 +ZM5ZllotrlVCrEuR2rill4ATTbbAAGbqc1x3rWjiKO9tDFdWE8eA5FFI0l+l0vN2JJ/ KOcs8uq0EgfauuiqAhJ+EyAdIyArBVQZleXDsC9+Y65t0Ah7T7EycST0FhI3B0NqNdLZ 7LktXyU+ZFH1sN0tCs8JNcXt7PMaZNrUeTRH5odrwfQDTzoRklARhYjTri1u5GPUV4On zbXlvwsaniPmz8no0TZXdSyo7ppNK8oRFBsnRk/FguMPba9qic+eU/uxEWlrk1Ov/NT9 GxIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=2XpKbXbAvctjhAtZ54apbx6nEQ6a3s5xJ5IE3TOd7qM=; b=PkoU7QUfjzIwh4xZbeuz29vUHIsVi/yQ85lYnjIJaBxBfI+SwPngk6i9ZJigq/I6Uk tjFgwbxPXo4SvDxkEc5XR6tcodnZ/OStQ6ZBdziG3kWncf/VsvyHPZXAIqRabMMpvilD gI1yMvnvHaJcSf82qTnbNXh82lEQpG36r02Xl2Vb3krv9oA+8VwoJOQqHOUdZ+TQiK7A MIX9vsZ/F2B9xafpJ+gC5+DODTGJVIskaADVHob6DRz2FtklgqTd91li5VXYdsGOE4kT XIJYjijO5D57WRcnwQiHWb0PFrET29IJu3pY4+2/hCP8U6jC3/lXbgnh1V3Qikv1Jt45 lKTA== X-Gm-Message-State: AD7BkJKgXRrzu/ex+EbCHyNcCAiD0j+F9BmCkSh5xksOApMaUJNAx76N88GkYJN5i6/CH2opzXlRxXdBDlEGhQ== MIME-Version: 1.0 X-Received: by 10.112.38.104 with SMTP id f8mr3781200lbk.115.1458834487787; Thu, 24 Mar 2016 08:48:07 -0700 (PDT) Received: by 10.25.216.8 with HTTP; Thu, 24 Mar 2016 08:48:07 -0700 (PDT) In-Reply-To: References: Date: Thu, 24 Mar 2016 08:48:07 -0700 Message-ID: Subject: Re: [DISCUSS] Remove required type From: Jacques Nadeau To: dev Content-Type: multipart/alternative; boundary=001a11345cc46e5d26052ecd5f74 --001a11345cc46e5d26052ecd5f74 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable My numbers show a declarative approach is unnecessary in execution. >> Having the right tools would help... Declarative is great in planning and should continue to exist. The right tools will continue to exist. It seems like a number of people here are worried performance of future features. I'm also focused on performance. Cleaning up mistakes is the way we're going to get to the next level of performance. It is clear from my numbers that a columnar observation approach would be a huge win across virtually all current workloads. I think there is a second dynamic here: this type of change, much like the a few others proposed right now are not trivial changes: there are huge benefits to what we're proposing but it is possible that some workloads won't be as good. That seems like a step-function change (and some would call it a breaking change). I'm going to start a new thread on creation of a v2 branch. -- Jacques Nadeau CTO and Co-Founder, Dremio On Thu, Mar 24, 2016 at 8:38 AM, Aman Sinha wrote: > With regard to the following: > > > > *>> The only time we use the "required" path is if the underlying data >> > guarantees that all the data will be non-null. I believe that path is >> > rarely used, poorly tested and provides only a small gain in performance = >> > when used.* > > The main reason this code path is less used is because currently there is > no declarative way of specifying the required type. Going forward, at > least 2 features (probably several more) that would require a declarative > approach: > > 1. INSERT INTO: I recall discussions from last year where we wanted = to > keep the merged schema in some metadata file. This would allow an > insert > row to be quickly rejected if its schema did not match the merged > schema. > 2. Sort physical property of a column in files in order to do merge-jo= in > or streaming aggregate without re-sorting the data. This physical > property > would also be declared in the metadata file. > > Once these functionality are added (I am not sure of the timeline but > likely in a few months) we could leverage the same declarative way for NO= T > NULL attributes for the underlying data. > > For data warehouse offloads (a major use-case of Drill), we need to make > the ForeignKey-PrimaryKey joins (assume both are guaranteed to be non-nul= l > for this scenario) as fast as possible to compete with the RDBMSs. Havi= ng > the right tools would help... > > > On Wed, Mar 23, 2016 at 2:55 PM, Jacques Nadeau > wrote: > > > There seems to be a lot of confusion on this thread. > > > > We have large amount of code that separates physical representations of > > data that can be possibly null versus data that can't be null. We have = a > > rigid concept in MajorType of whether data is nullable or required. If = we > > change from one to the other, that is a schema change inside of Drill > (and > > treated much the same as changing from Integer to Map). As we compile > > expression trees, we have to constantly manage whether or not items are > > null or not null. We also don't cast between the two. So UDF, Vector > > classes, code generation, schema management, schema change are all much > > more complicated because of this fact. I proposed this complexity > initially > > but looking at the continued cost and nominal benefit, think it was a > > mistake. > > > > The only time we use the "required" path is if the underlying data > > guarantees that all the data will be non-null. I believe that path is > > rarely used, poorly tested and provides only a small gain in performanc= e > > when used. In essence, it creates a permutation nightmare (just like us > > having too many minor types) with marginal benefit. > > > > The proposal here is to correct that mistake. > > > > **Separately**, Drill should take better advantage of observed not-null > > data. > > > > >> You may not generate not-null data, but a lot of data is not-null. > > > > Yes! You are 100% correct. Drill often chews through large amounts of > data > > that is annotated as nullable but has no nulls. For example, we run > > benchmarks on TPCH data. The TPCH dataset doesn't have nulls. However, = we > > store the data as nullable (to be consistent with how virtually all > systems > > generate the data). As such, *Drill uses the nullable path* for the > > entirety of execution. This is a great opportunity for performance > > improvements. However, it is orthogonal to whether we remove the code > path > > above ** since it doesn't use it**. Ultimately we should allow the > > execution engine to decide the operation path **rather than having a > schema > > level concept** that creates more code combinations and schema change. > > > > My additional perspective is that having the mistake cruft above means > that > > doing the right thing of using observed nulls instead of annotated null= s > is > > substantially harder to implement and reduces the likelihood that it wi= ll > > be implemented. > > > > With regards to columnar benefits for calculations (which I again argue > is > > actually orthogonal to the initial proposal), I put together an ideal > > condition test. In reality, we have more indirection and I'd actually > > expect a larger benefit moving to columnar null evaluation than is this > > test. (For example: (1) everybody still runs with bounds checking which > > introduces an additional check for each null bit and (2) we always read > > memory values in addition to null bits before inspecting the null bits)= . > As > > you can see below, having a columnar approach means that performance > varies > > little depending on nullability. Optimizing for the columnar no-nulls > case > > provides 5-6% additional performance which seems like a late optimizati= on > > compared to where we should be focused: moving to columnar execution. > > > > Benchmark Mode Cnt Score > > Error Units > > ColumnarComparisons.a_plus_b_columnar avgt 200 2059.743 = =C2=B1 > > 9.625 ns/op > > ColumnarComparisons.a_plus_b_non_null avgt 200 1934.380 = =C2=B1 > > 10.279 ns/op > > ColumnarComparisons.a_plus_b_current_drill avgt 200 6737.569 = =C2=B1 > > 396.452 ns/op > > ColumnarComparisons.a_plus_b_plus_c_columnar avgt 200 2565.702 = =C2=B1 > > 12.139 ns/op > > ColumnarComparisons.a_plus_b_plus_c_non_null avgt 200 2437.322 = =C2=B1 > > 12.875 ns/op > > ColumnarComparisons.a_plus_b_plus_c_current_drill avgt 200 9010.913 = =C2=B1 > > 475.392 ns/op > > > > This comes out as: > > > > columnar a+b 0.5ns/record > > current a+b 1.7ns/record > > no-null a+b 0.5ns/record > > columnar a+b+c 0.6ns/record > > current a+b+c 2.25ns/record > > no-null a+b+c 0.6ns/record > > > > relative differences: > > columnar versus current (a+b) : 3.2x > > columnar versus current (a+b+c) : 3.5x > > columnar no-nulls eval null: 1.06x > > columnar no-nulls eval null: 1.05x > > > > Code here: https://gist.github.com/jacques-n/70fa5afdeadba28ea398 > > > > > > > > > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Wed, Mar 23, 2016 at 11:58 AM, Parth Chandra > wrote: > > > > > Hmm. I may not have expressed my thoughts clearly. > > > What I was suggesting was that 'non-null' data exists in all data set= s. > > (I > > > have at least two data sets from users with Drill in production (sorr= y, > > > cannot share the data), that have required fields in parquet files). > The > > > fields may not be marked as such in the metadata, or the data source > may > > > not have any such metadata, but if we can identify the type as > non-null, > > > then we can (and should) take advantage of it. > > > If we are already taking advantage of it, then we should not make any > > > changes without understanding the tradeoffs. > > > So in the spirit of understanding that, I'd like to ask two questions= - > > > 1) Where specifically are you suggesting code complexity will decreas= e. > > You > > > mentioned UDFs. Where else do you see the code is more complex? > > > 2) Do we have any experiments to show whether columnar processing > > benefits > > > from eliminating required fields? > > > > > > > > > > > > On Wed, Mar 23, 2016 at 8:36 AM, Jacques Nadeau > > > wrote: > > > > > > > I agree that we should focus on real benefits versus theories. > > > > > > > > Reduction in code complexity is a real benefit. Performance benefit= s > > from > > > > having required types is theoretical. Dot drill files don't exist s= o > > they > > > > should have little bearing on this conversation. > > > > > > > > We rarely generate required data. Most tools never generate it. The > > > reason > > > > the question is about actual deployments is that would be a real > factor > > > to > > > > counterbalance the drive for code simplification rather than > something > > > > theoretical. A theoretical future performance regression shouldn't > stop > > > > code improvement. If it did, we wouldn't make any progress. > > > > > > > > What about your own internal benchmark tests. If removing required > > types > > > > doesn't impact them, doesn't that mean this hasn't been a point of > > focus? > > > > On Mar 22, 2016 8:36 PM, "Parth Chandra" wrote: > > > > > > > > > I don't know if the main question is whether people have parquet > (or > > > > other > > > > > ) files which have required fields or not. With something like a > dot > > > > drill > > > > > file, a user can supply schema or format for data that does not > carry > > > > > schema, and we can certainly use the same to indicate knowledge o= f > > > > > nullability. The question is whether we can take advantage of > knowing > > > > > whether data is null or not to get better performance. > > > > > > > > > > Any argument that applies to taking advantage of non-nullability = at > > the > > > > > batch level applies to taking advantage of non-nullability at the > > > schema > > > > > level. > > > > > > > > > > I'm not entirely convinced that the reduction of code complexity = is > > > > > ultimately leading to performance gain. Sure, it improves > > > > maintainability, > > > > > but what specific improvements are you thinking of that will > increase > > > > > performance? > > > > > > > > > > If you recommend some areas of improvement that become possible a= s > a > > > > result > > > > > of this change, then I would suggest we run some experiments befo= re > > we > > > > make > > > > > any change. > > > > > > > > > > It is a capital mistake to theorize before one has data, etc... > > > > > > > > > > A 15% performance drop is not something to be ignored, I would > think. > > > > > > > > > > Parth > > > > > > > > > > On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau < > jacques@dremio.com> > > > > > wrote: > > > > > > > > > > > > Re Performance: > > > > > > > > > > > > I think the main question is what portion of people's data is > > > actually > > > > > > marked as non-nullable in Parquet files? (We already treat json= , > > > avro, > > > > > > kudu, and hbase (except row key) as nullable. We do treat csv a= s > > > > > > non-nullable (array) but I think these workloads start with > > > conversion > > > > to > > > > > > Parquet.) Early on, we typically benchmarked Drill using > required > > > > fields > > > > > > in Parquet. At the time, we actually hacked the Pig code to get > > > > something > > > > > > to even generate this format. (I believe, to this day, Pig only > > > > generates > > > > > > nullable fields in Parquet.) After some time, we recognized tha= t > > > > > basically > > > > > > every tool was producing Parquet files that were nullable and > > > > ultimately > > > > > > moved the benchmark infrastructure to using nullable types to d= o > a > > > > better > > > > > > job of representing real-world workloads. > > > > > > > > > > > > Based on my (fuzzy) recollection, working with nullable types > had a > > > > > 10-15% > > > > > > performance impact versus working on required types so I think > > there > > > > is a > > > > > > performance impact but I think the population of users who have > > > > > > non-nullable Parquet files are small. If I recall, I believe > Impala > > > > also > > > > > > creates nullable Parquet files. Not sure what Spark does. I > believe > > > > Hive > > > > > > has also made this change recently or is doing it (deprecating > > > > non-nulls > > > > > in > > > > > > their internals). > > > > > > > > > > > > If we move forward with this, I would expect there initially > would > > > be a > > > > > > decrease in performance when data is held as non-nullable given > we > > > > > > previously observed this. However, I believe the reduction in > code > > > > > > complexity would leads us to improve other things more quickly. > > Which > > > > > leads > > > > > > me to... > > > > > > > > > > > > Re: Why > > > > > > > > > > > > Drill suffers from code complexity. This hurts forward progress= . > > One > > > > > > example is the fact that we have to generate all nullable > > > permutations > > > > of > > > > > > functions. (For example, if we have three arguments, we have to > > > > generate > > > > > 8 > > > > > > separate functions to work with the combination of argument > > > > > nullabilities). > > > > > > This leads to vastly more reliance on compile-time templating > which > > > is > > > > a > > > > > > maintenance headache. Additionally, it makes the runtime code > > > > generation > > > > > > more complicated and error prone. Testing is also more expensiv= e > > > > because > > > > > we > > > > > > now have twice as many paths to both validate and maintain. > > > > > Realistically, > > > > > > we should try to move to more columnar algorithms, which would > > > provide > > > > a > > > > > > bigger lift than this declared schema nullability optimization. > > This > > > is > > > > > > because many workloads have rare nulls so we can actually > optimize > > > > better > > > > > > at the batch level. Creating three code paths (nullable observe= d > > > > > non-null, > > > > > > nullable observed null and non-null) make this substantially mo= re > > > > > > complicated. We want to invest in this area but the code > complexity > > > of > > > > > > nullable versus required makes this tasks less likely to > > > happen/harder. > > > > > So, > > > > > > in essence, I'm arguing that it is a small short-term loss that > > leads > > > > to > > > > > > better code quality and ultimately faster performance. > > > > > > > > > > > > Do others have real-world observations on the frequency of > required > > > > > fields > > > > > > in Parquet files? > > > > > > > > > > > > thanks, > > > > > > Jacques > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jacques Nadeau > > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > > > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra < > parthc@apache.org> > > > > > wrote: > > > > > > > > > > > > > I'm not entirely convinced that this would have no performanc= e > > > > impact. > > > > > Do > > > > > > > we have any experiments? > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau < > > > jacques@dremio.com> > > > > > > > wrote: > > > > > > > > > > > > > > > My suggestion is we use explicit observation at the batch > > level. > > > If > > > > > there > > > > > > > > are no nulls we can optimize this batch. This would > ultimately > > > > > improve > > > > > > > over > > > > > > > > our current situation where most parquet and all json data = is > > > > > nullable so > > > > > > > > we don't optimize. I'd estimate that the vast majority of > > Drills > > > > > > > workloads > > > > > > > > are marked nullable whether they are or not. So what we're > > really > > > > > > > > suggesting is deleting a bunch of code which is rarely in t= he > > > > > execution > > > > > > > > path. > > > > > > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" > > > > wrote: > > > > > > > > > > > > > > > > > I was thinking about it more after sending the previous > > > concerns. > > > > > > > Agree, > > > > > > > > > this is an execution side change...but some details need = to > > be > > > > > worked > > > > > > > > out. > > > > > > > > > If the planner indicates to the executor that a column is > > > > > non-nullable > > > > > > > > (e.g > > > > > > > > > a primary key), the run-time generated code is more > > efficient > > > > > since it > > > > > > > > > does not have to check the null bit. Are you thinking we > > would > > > > use > > > > > the > > > > > > > > > existing nullable vector and add some additional metadata > > (at a > > > > > record > > > > > > > > > batch level rather than record level) to indicate > > > > non-nullability ? > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau < > > > > > jacques@dremio.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hey Aman, I believe both Steven and I were only > suggesting > > > > > removal > > > > > > > only > > > > > > > > > > from execution, not planning. It seems like your concer= ns > > are > > > > all > > > > > > > > related > > > > > > > > > > to planning. Iit seems like the real tradeoffs in > execution > > > are > > > > > > > > nominal. > > > > > > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" < > > amansinha@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > While it is true that there is code complexity due to > the > > > > > required > > > > > > > > > type, > > > > > > > > > > > what would we be trading off ? some important > > > > considerations: > > > > > > > > > > > - We don't currently have null count statistics whi= ch > > > would > > > > > need > > > > > > > to > > > > > > > > > be > > > > > > > > > > > implemented for various data sources > > > > > > > > > > > - Primary keys in the RDBMS sources (or rowkeys in > > hbase) > > > > are > > > > > > > > always > > > > > > > > > > > non-null, and although today we may not be doing > > > > optimizations > > > > > to > > > > > > > > > > leverage > > > > > > > > > > > that, one could easily add a rule that converts WHE= RE > > > > > primary_key > > > > > > > > IS > > > > > > > > > > NULL > > > > > > > > > > > to a FALSE filter. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky < > > > > > > > > > doshinsky@commvault.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Jacques, > > > > > > > > > > > > Marginally related to this, I made a small change i= n > > > PR-372 > > > > > > > > > > (DRILL-4184) > > > > > > > > > > > > to support variable widths for decimal quantities i= n > > > > Parquet. > > > > > I > > > > > > > > > found > > > > > > > > > > > the > > > > > > > > > > > > (decimal) vectoring code to be very difficult to > > > understand > > > > > > > > (probably > > > > > > > > > > > > because it's overly complex, but also because I'm n= ew > > to > > > > > Drill > > > > > > > code > > > > > > > > > in > > > > > > > > > > > > general), so I made a small, surgical change in my > pull > > > > > request > > > > > > > to > > > > > > > > > > > support > > > > > > > > > > > > keeping track of variable widths (lengths) and null > > > > booleans > > > > > > > within > > > > > > > > > the > > > > > > > > > > > > existing fixed width decimal vectoring scheme. Can > my > > > > > changes be > > > > > > > > > > > > reviewed/accepted, and then we discuss how to fix > > > properly > > > > > > > > long-term? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Dave Oshinsky > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > > > From: Jacques Nadeau [mailto:jacques@dremio.com] > > > > > > > > > > > > Sent: Monday, March 21, 2016 11:43 PM > > > > > > > > > > > > To: dev > > > > > > > > > > > > Subject: Re: [DISCUSS] Remove required type > > > > > > > > > > > > > > > > > > > > > > > > Definitely in support of this. The required type is= a > > > huge > > > > > > > > > maintenance > > > > > > > > > > > and > > > > > > > > > > > > code complexity nightmare that provides little to n= o > > > > benefit. > > > > > As > > > > > > > > you > > > > > > > > > > > point > > > > > > > > > > > > out, we can do better performance optimizations > though > > > null > > > > > count > > > > > > > > > > > > observation since most sources are nullable anyway. > > > > > > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" < > > > > > steven@dremio.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > I have been thinking about this for a while now, > and > > I > > > > feel > > > > > it > > > > > > > > > would > > > > > > > > > > > > > be a good idea to remove the Required vector type= s > > from > > > > > Drill, > > > > > > > > and > > > > > > > > > > > > > only use the Nullable version of vectors. I think > > this > > > > will > > > > > > > > greatly > > > > > > > > > > > > simplify the code. > > > > > > > > > > > > > It will also simplify the creation of UDFs. As is= , > > if a > > > > > > > function > > > > > > > > > has > > > > > > > > > > > > > custom null handling (i.e. INTERNAL), the functio= n > > has > > > to > > > > > be > > > > > > > > > > > > > separately implemented for each permutation of > > > > nullability > > > > > of > > > > > > > the > > > > > > > > > > > > > inputs. But if drill data types are always > nullable, > > > this > > > > > > > > wouldn't > > > > > > > > > > be a > > > > > > > > > > > > problem. > > > > > > > > > > > > > > > > > > > > > > > > > > I don't think there would be much impact on > > > performance. > > > > In > > > > > > > > > practice, > > > > > > > > > > > > > I think the required type is used very rarely. An= d > > > there > > > > > are > > > > > > > > other > > > > > > > > > > > > > ways we can optimize for when a column is known t= o > > have > > > > no > > > > > > > nulls. > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ***************************Legal > > > > > > > > > Disclaimer*************************** > > > > > > > > > > > > "This communication may contain confidential and > > > privileged > > > > > > > > material > > > > > > > > > > for > > > > > > > > > > > > the > > > > > > > > > > > > sole use of the intended recipient. Any unauthorize= d > > > > review, > > > > > use > > > > > > > or > > > > > > > > > > > > distribution > > > > > > > > > > > > by others is strictly prohibited. If you have > received > > > the > > > > > > > message > > > > > > > > by > > > > > > > > > > > > mistake, > > > > > > > > > > > > please advise the sender by reply email and delete > the > > > > > message. > > > > > > > > Thank > > > > > > > > > > > you." > > > > > > > > > > > > > > > > > > > > > > > > > > > > ********************************************************************** > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --001a11345cc46e5d26052ecd5f74--