Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B33B3200B17 for ; Tue, 21 Jun 2016 18:15:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B1DD5160A4F; Tue, 21 Jun 2016 16:15:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A84B1160A07 for ; Tue, 21 Jun 2016 18:15:16 +0200 (CEST) Received: (qmail 23610 invoked by uid 500); 21 Jun 2016 16:15:15 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 23592 invoked by uid 99); 21 Jun 2016 16:15:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jun 2016 16:15:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 09A4C18023E for ; Tue, 21 Jun 2016 16:15:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.247 X-Spam-Level: X-Spam-Status: No, score=-0.247 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id hsgQmpi-Tz-y for ; Tue, 21 Jun 2016 16:15:10 +0000 (UTC) Received: from nm22-vm2.bullet.mail.ne1.yahoo.com (nm22-vm2.bullet.mail.ne1.yahoo.com [98.138.91.210]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id B4B605F239 for ; Tue, 21 Jun 2016 16:15:09 +0000 (UTC) Received: from [98.138.226.176] by nm22.bullet.mail.ne1.yahoo.com with NNFMP; 21 Jun 2016 16:15:03 -0000 Received: from [98.138.89.251] by tm11.bullet.mail.ne1.yahoo.com with NNFMP; 21 Jun 2016 16:15:03 -0000 Received: from [127.0.0.1] by omp1043.mail.ne1.yahoo.com with NNFMP; 21 Jun 2016 16:15:03 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 194739.29952.bm@omp1043.mail.ne1.yahoo.com X-YMail-OSG: _Gan15kVM1kUVTcoxxX1_0Rm.sw.EP1FRZaCVaHU53TZERbBEFNWVjcCKPMkBon _kO0OwWoSOEM.VdtnLZNYGehFA0SoKKiWUeeafHActM8in4HeRf7ruekEv1033rmdU.Ir38H_BwI ZT7c0VBLXJfgw2fzy_5BrGhEyA_jn1UbfuhRH0Chk0zfxF6O3rVO1KYXIIG27R7d8syJKwNthSu3 0yy_ObFRisfzJc6lbciAS7uzY5uVQ8is4W8Una0Vt7gW7BOw4tvS1uhxeRvKhjgeD.cpVGB3A104 AZ7F9SHm4NFCi7TGwhD4MLAbCuBBSnL9MvYCVkhk9NYa5w3JulEpE_H7OttpYnJo2wwwMgdmu6Zy Vw9tUP6veAURQSxRG4g3T8HjQrV7BRnLevYsLP5QnNjs6v2pqB7rKqu3wk2MzD7Mj1gSHTFw.6aL cvU2SdhV3V6Gmftov1yHHTR7x8l09Sxt8h41KGvoVYvHf6sUl8vB3dVggEZi3U9GLImVuCzzWipH TRxP2FY6.vQvcAU3KV4hSXs_g9UbFQ0aECDR2rvg- Received: from jws10053.mail.ne1.yahoo.com by sendmailws147.mail.ne1.yahoo.com; Tue, 21 Jun 2016 16:15:02 +0000; 1466525702.664 Date: Tue, 21 Jun 2016 16:15:00 +0000 (UTC) From: yuliya Feldman Reply-To: yuliya Feldman To: "dev@drill.apache.org" Message-ID: <826331634.1895282.1466525700374.JavaMail.yahoo@mail.yahoo.com> In-Reply-To: <9F93A3B3-164F-461C-B1C2-C35D2CFCCD13@maprtech.com> References: <9346FF31-737E-4D20-A9A2-52910FBC9107@gmail.com> <589CC834-6617-4F2A-8A6D-8C4B5BA26AED@maprtech.com> <7DC68202-6186-4BC4-82DF-A01F5FB5D5A7@maprtech.com> <9F93A3B3-164F-461C-B1C2-C35D2CFCCD13@maprtech.com> Subject: Re: Dynamic UDFs support MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1895281_1885777437.1466525700362" archived-at: Tue, 21 Jun 2016 16:15:17 -0000 ------=_Part_1895281_1885777437.1466525700362 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Just thoughts: You can try to reuse distributed cache=C2=A0Let Drill AM do the needful in = terms of orchestrating UDF jars distribution. But I would be inclined to have a common path that is independent of the fact t= hat it is Drill on YARN or not, as maintaining two separate ways of dealing= with loading/unloading UDFs will be painful and error prone. One more note (I left a comment in the doc) - not sure about authorization = model here - we need to have some. Just my 2cThanks From: Paul Rogers To: "dev@drill.apache.org" =20 Sent: Monday, June 20, 2016 7:32 PM Subject: Re: Dynamic UDFs support =20 Hi Neeraja, The proposal calls for the user to copy the jar file to each Drillbit node.= The jar would go into a new $DRILL_HOME/jars/3rdparty/udf directory. In Drill-on-YARN (DoY), YARN is responsible for copying Drill code to each = node (which is good.) YARN puts that code in a location known only to YARN.= Since the location is private to YARN, the user can=E2=80=99t easily hunt = down the location in order to add the udf jar. Even if the user did find th= e location, the next Drillbit to start would create a new copy of the Drill= software, without the udf jar. Second, in DoY we have separated user files from Drill software. This makes= it much easier to distribute the software to each node: we give the Drill = distribution tar archive to YARN, and YARN copies it to each node and untar= s the Drill files. We make a separate copy of the (far smaller) set of user= config files. If the udf jar goes into a Drill folder ($DRILL_HOME/jars/3rdparty/udf), th= en the user would have to rebuild the Drill tar file each time they add a u= df jar. When I tried this myself when building DoY, I found it to be slow a= nd error-prone. So, the solution is to place the udf code in the new =E2=80=9Csite=E2=80=9D= directory: $DRILL_SITE/jars. That=E2=80=99s what that is for. Then, let Do= Y automatically distribute the code to every node. Perfect! Except that it = does not work to dynamically distribute code after Drill starts. For DoY, the solution requirements are: 1. Distribute code using Drill itself, rather than manually copying jars to= (unknown) Drill directories. 2. Ensure the solution works even if another Drillbit is spun up later, and= uses the original Drill tar file. I=E2=80=99m thinking we want to leverage DFS: place udf files into a well-k= nown DFS directory. Register the udf into, say, ZK. When a new Drillbit sta= rts, it looks for new udf jars in ZK, copies the file to a temporary locati= on, and launches. An existing Drill is notified of the change and does the = same download process. Clean-up is needed at some point to remove ZK entrie= s if the udf jar becomes statically available on the next launch. That need= s more thought. We=E2=80=99d still need the phases mentioned earlier to ensure consistency. Suggestions anyone as to how to do this super simply & still get it to work= with DoY? Thanks, - Paul =20 > On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala wrote: >=20 > This will need to work with YARN (Once Drill is YARN enabled, I would > expect a lot of users using it in conjunction with YARN). > Paul, I am not clear why this wouldn't work with YARN. Can you elaborate. >=20 > -Neeraja >=20 > On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers wrote= : >=20 >> Good enough, as long as we document the limitation that this feature can= =E2=80=99t >> work with YARN deployment as users generally do not have access to the >> temporary =E2=80=9Clocalization=E2=80=9D directories where the Drill cod= e is placed by YARN. >>=20 >> Note that the jar distribution race condition issue occurs with the >> proposed design: I believe I sketched out a scenario in one of the earli= er >> comments. Drillbit A receives the CREATE FUNCTION command. It tells >> Drillbit B. While informing the other Drillbits, Drillbit B plans and >> launches a query that uses the function. Drillbit Z starts execution of = the >> query before it learns from A about the new function. This will be rare = =E2=80=94 >> just rare enough to create very hard to reproduce bugs. >>=20 >> The only reliable solution is to do the work in multiple passes: >>=20 >> Pass 1: Ask each node to load the function, but not make it available to >> the planner. (it would be available to the execution engine.) >> Pass 2: Await confirmation from each node that this is done. >> Pass 3: Alert every node that it is now free to plan queries with the >> function. >>=20 >> Finally, I wonder if we should design the SQL syntax based on a long-ter= m >> design, even if the feature itself is a short-term work-around. Changing >> the syntax later might break scripts that users might write. >>=20 >> So, the question for the group is this: is the value of semi-complete >> feature sufficient to justify the potential problems? >>=20 >> - Paul >>=20 >>> On Jun 20, 2016, at 6:15 PM, Parth Chandra >> wrote: >>>=20 >>> Moving discussion to dev. >>>=20 >>> I believe the aim is to do a simple implementation without the complexi= ty >>> of distributing the UDF. I think the document should make this limitati= on >>> clear. >>>=20 >>> Per Paul's point on there being a simpler solution of just having each >>> drillbit detect the if a UDF is present, I think the problem is if a UD= F >>> get's deployed to some but not all drillbits. A query can then start >>> executing but not run successfully. The intent of the create commands >> would >>> be to ensure that all drillbits have the UDF or none would. >>>=20 >>> I think Jacques' point about ownership conflicts is not addressed >> clearly. >>> Also, the unloading is not clear. The delete command should probably >> remove >>> the UDF and unload it. >>>=20 >>>=20 >>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers >> wrote: >>>=20 >>>> Reviewed the spec; many comments posted. Three primary comments for th= e >>>> community to consider. >>>>=20 >>>> 1. The design conflicts with the Drill-on-YARN project. Is this a >> specific >>>> fix for one unique problem, or is it worth expanding the solution to >> work >>>> with Drill-on-YARN deployments? Might be hard to make the two work >> together >>>> later. See comments in docs for details. >>>>=20 >>>> 2. Have we, by chance, looked at how other projects handle code >>>> distribution? Spark, Storm and others automatically deploy code across >> the >>>> cluster; no manual distribution to each node. The key difference betwe= en >>>> Drill and others is that, for Storm, say, code is associated with a jo= b >>>> (=E2=80=9Ctopology=E2=80=9D in Storm terms.) But, in Drill, functions = are global and >> have >>>> no obvious life cycle that suggests when the code can be unloaded. >>>>=20 >>>> 3. Have considered the class loader, dependency and name space isolati= on >>>> issues addressed by such products as Tomcat (web apps) or Eclipse >>>> (plugins)? Putting user code in the same namespace as Drill code=C2=A0= is >> quick >>>> & dirty. It turns out, however, that doing so leads to problems that >>>> require long, frustrating debugging sessions to resolve. >>>>=20 >>>> Addressing item 1 might expand scope a bit. Addressing items 2 and 3 >> are a >>>> big increase in scope, so I won=E2=80=99t be surprised if we leave tho= se issues >> for >>>> later. (Though, addressing item 2 might be the best way to address ite= m >> 1.) >>>>=20 >>>> If we want a very simple solution that requires minimal change, perhap= s >> we >>>> can use an even simpler solution. In the proposed design, the user sti= ll >>>> must distribute code to all the nodes. The primary change is to tell >> Drill >>>> to load (or unload) that code. Can accomplish the same result easier >> simply >>>> by having Drill periodically scan certain directories looking for new >> (or >>>> removed) jars? Still won=E2=80=99t work with YARN, or solve the name s= pace >> issues, >>>> but will work for existing non-YARN Drill users without new SQL syntax= . >>>>=20 >>>> Thanks, >>>>=20 >>>> - Paul >>>>=20 >>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau >> wrote: >>>>>=20 >>>>> Two quick thoughts: >>>>>=20 >>>>> - (user) In the design document I didn't see any discussion of >>>>> ownership/conflicts or unloading. Would be helpful to see the thinkin= g >>>> there >>>>> - (dev) There is a row oriented facade via the >>>>> FieldReader/FieldWriter/ComplexWriter classes. That would be a good >> place >>>>> to start when trying to implement an alternative interface. >>>>>=20 >>>>>=20 >>>>> -- >>>>> Jacques Nadeau >>>>> CTO and Co-Founder, Dremio >>>>>=20 >>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik >> wrote: >>>>>=20 >>>>>> Honestly, I don't see it as a priority issue. I think some of the >> ideas >>>>>> around community java UDFs could be a better approach. I'd hate to >> take >>>>>> away from other work to hack in something like this. >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers >>>> wrote: >>>>>>=20 >>>>>>> Ted refers to source code transformation. Drill gains its speed fro= m >>>>>> value >>>>>>> vectors. However, VVs are a far cry from the row-based interface th= at >>>>>> most >>>>>>> mere mortals are accustomed to using. Since VVs are very type >> specific, >>>>>>> code is typically generated to handle the specifics of each type. >>>>>> Accessing >>>>>>> VVs in Jython may be a bit of a challenge because of the "impedence >>>>>>> mismatch" between how VVs work and the row-and-column view expected >> by >>>>>> most >>>>>>> (non-Drill) developers. >>>>>>>=20 >>>>>>> I wonder if we've considered providing a row-oriented "facade" that >> can >>>>>> be >>>>>>> used by roll-your own data sources and user-defined row transforms? >>>> Might >>>>>>> be a hiccup in the fast VV pipeline, but might be handy for users >>>> willing >>>>>>> to trade a bit of speed for convenience. With such a facade, the >> Jython >>>>>> row >>>>>>> transforms that John mentions could be quite simple. >>>>>>>=20 >>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning >>=20 >>>>>>> wrote: >>>>>>>=20 >>>>>>>> Since UDF's use source code transformation, using Jython would be >>>>>>>> difficult. >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva < >>>>>>>> arina.yelchiyeva@gmail.com> wrote: >>>>>>>>=20 >>>>>>>>> Hi Charles, >>>>>>>>>=20 >>>>>>>>> not that I am aware of. Proposed solution doesn't invent anything >>>>>> new, >>>>>>>> just >>>>>>>>> adds possibility to add UDFs without drillbit restart. But >>>>>>> contributions >>>>>>>>> are welcomed. >>>>>>>>>=20 >>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre >>>>>>> wrote: >>>>>>>>>=20 >>>>>>>>>> Arina, >>>>>>>>>> Has there been any discussion about making it possible via Jytho= n >>>>>> or >>>>>>>>>> something for users to write simple UDFs in Python? >>>>>>>>>> My ideal would be to have this capability integrated in the web >> GUI >>>>>>>> such >>>>>>>>>> that a user could write their UDF (in Python) right there, submi= t >>>>>> it >>>>>>>> and >>>>>>>>> it >>>>>>>>>> would be deployed to Drill if it passes validation tests. >>>>>>>>>> =E2=80=94C >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva < >>>>>>>>> arina.yelchiyeva@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>=20 >>>>>>>>>>> Hi all! >>>>>>>>>>>=20 >>>>>>>>>>> I have created Jira to allow dynamic UDFs support in Drill ( >>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726). There is a >>>>>> link >>>>>>>> to >>>>>>>>>>> design document in Jira description. >>>>>>>>>>> Comments or suggestions are welcomed. >>>>>>>>>>>=20 >>>>>>>>>>> Kind regards >>>>>>>>>>> Arina >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>=20 >>>>>>>=20 >>>>>>=20 >>>>=20 >>>>=20 >>=20 >>=20 ------=_Part_1895281_1885777437.1466525700362--