Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D5DAFFC54 for ; Mon, 6 May 2013 17:13:36 +0000 (UTC) Received: (qmail 24945 invoked by uid 500); 6 May 2013 17:13:36 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 24905 invoked by uid 500); 6 May 2013 17:13:36 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 24896 invoked by uid 99); 6 May 2013 17:13:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 May 2013 17:13:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aseldawy@gmail.com designates 209.85.210.42 as permitted sender) Received: from [209.85.210.42] (HELO mail-da0-f42.google.com) (209.85.210.42) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 May 2013 17:13:32 +0000 Received: by mail-da0-f42.google.com with SMTP id r6so1919302dad.29 for ; Mon, 06 May 2013 10:13:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=I91iy3LwvE2oW6rR0p2+FuMfdfft285ZPBSHw5h+CGQ=; b=ctcxX2HjksLlllYLdtWjOSxIuDpX0ZVS02Zkp1u20ALqJ3RzLfJ0jemj7MlUcg2ztK PRP3UoCinDZBgtu681mY5IImAeWFjuKMTPGorHpuNGDdxmlo1rJEVA/sRfH/cktlc3wG s8N5ig1uGE/shRaY+hoZivnJghPRBY9iZmtYf2mLZ60UvQSjdynNBr1PabNGte2+nAtw ZnMtCtsi76/9HJS36iXwRij1m3yAaPOjyy1kOKET03dCIFfLRlrbmZ6lz6eerEArtFPR Pwggy+qRPsB91ZN3fc4WjGmwwUSo7bCQFdNBXTO6CQOv8xOW1dROcrcFf3NOfR9Wg6r+ Xs0Q== X-Received: by 10.66.19.234 with SMTP id i10mr27656170pae.152.1367860392284; Mon, 06 May 2013 10:13:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.167.104 with HTTP; Mon, 6 May 2013 10:12:51 -0700 (PDT) In-Reply-To: References: <1644602176555856708@unknownmsgid> From: Ahmed Eldawy Date: Mon, 6 May 2013 12:12:51 -0500 Message-ID: Subject: Re: A major addition to Pig. Working with spatial data To: dev@pig.apache.org Content-Type: multipart/alternative; boundary=bcaec520f501c8dc5b04dc0fd087 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec520f501c8dc5b04dc0fd087 Content-Type: text/plain; charset=UTF-8 I contacted solr developers to see how JTS can be included in an Apache project. See http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/ As far as I understand, they did not include it in the main solr project, rather, they created a separate project (spatial 4j) which is still licensed under Apache license and refers to JTS. Users will have to download JTS libraries separately to make it run. That's pretty much the same plan that Jonathan mentioned. We will still have the overhead of serializing/deserializing the shapes each time a function is called. Also, we will have to use the ugly bytearray data type for spatial data instead of creating its own data type (e.g., Geometry). I think using spatial 4j instead of JTS will not be sufficient for our case as we need to provide an access to all spatial functions of JTS such as Union, Intersection, Difference, ... etc. This way we can claim conformity with OGC standards which gives visibility and appreciations of the spatial community. I think also that this means I will not add any issues to JIRA as it is now a separate project. I'm planning to host it on github and have all the issues there. Let me know if you have any suggestions or comments. Thanks Ahmed Best regards, Ahmed Eldawy On Mon, May 6, 2013 at 9:53 AM, Jonathan Coveney wrote: > You can give them all the same label or tag and filter on that later on. > > > 2013/5/6 Ahmed Eldawy > > > Thanks all for taking the time to respond. Danial, I didn't know that > Solr > > uses JTS. This is a good finding and we can definitely ask them to see if > > there is a work around we can do. Jonathan, I thought of the same idea of > > serializing/deserializing a bytearray each time a UDF is called. The > > deserialization part is good for letting Pig auto detect spatial types if > > not set explicitly in the schema. What is the best way to start this? I > > want to add an initial set of JIRA issues and start working on them but I > > also need to keep the work grouped in some sense just for organization. > > > > Thanks > > Ahmed > > > > Best regards, > > Ahmed Eldawy > > > > > > On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney > > wrote: > > > > > I agree that this is cool, and if other projects are using JTS it is > > worth > > > talking them to see how. I also agree that licensing is very > frustrating. > > > > > > In the short term, however, while it is annoying to have to manage the > > > serialization and deserialization yourself, you can have the geometry > > type > > > be passed around as a bytearray type. Your UDF's will have to know this > > and > > > treat it accordingly, but if you did this then all of the tools could > be > > in > > > an external project on github instead of a branch in Pig. Then, if we > can > > > get the licensing done, we could add the Geometry type to Pig. Adding > > > types, honestly, is kind of tedious but not super difficult, so once > the > > > rest is done, that shouldn't be too difficult. > > > > > > > > > 2013/5/4 Russell Jurney > > > > > > > If a way could be found, this would be an awesome addition to Pig. > > > > > > > > Russell Jurney http://datasyndrome.com > > > > > > > > On May 3, 2013, at 4:09 PM, Daniel Dai > wrote: > > > > > > > > > I am not sure how other Apache projects dealing with it? Seems Solr > > > also > > > > > has some connector to JTS? > > > > > > > > > > Thanks, > > > > > Daniel > > > > > > > > > > > > > > > On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy > > > > wrote: > > > > > > > > > >> Thanks Alan for your interest. It's too bad that an open source > > > > licensing > > > > >> issue is holding me back from doing some open source work. I > > > understand > > > > the > > > > >> issue and your workarounds make sense. However, as I mentioned in > > the > > > > >> beginning, I don't want to have my own branch of Pig because it > > makes > > > my > > > > >> extension less portable. I'll think of another way to do it. I'll > > ask > > > > vivid > > > > >> solutions if they can double license their code although I think > the > > > > answer > > > > >> will be no. I'll also think of a way to ship my extension as a set > > of > > > > jar > > > > >> files without the need to change the core of Pig. This way, it can > > be > > > > >> easily ported to newer versions of Pig. > > > > >> > > > > >> Thanks > > > > >> Ahmed > > > > >> > > > > >> Best regards, > > > > >> Ahmed Eldawy > > > > >> > > > > >> > > > > >> On Thu, May 2, 2013 at 12:33 PM, Alan Gates < > gates@hortonworks.com> > > > > wrote: > > > > >> > > > > >>> I know this is frustrating, but the different licenses do have > > > > different > > > > >>> requirements that make it so that Apache can't ship GPL code. A > > > legal > > > > >>> explanation is at > > > > http://www.apache.org/licenses/GPL-compatibility.htmlFor additional > > info > > > > on the LGPL specific questions see > > > > >>> http://www.apache.org/legal/3party.html > > > > >>> > > > > >>> As far as pulling it in via ivy, the issue isn't so much where > the > > > code > > > > >>> lives as much as what code we are requiring to make Pig work. If > > > > >> something > > > > >>> that is [L]GPL is required for Pig it violates Apache rules as > > > outlined > > > > >>> above. It also would be a show stopper for a lot of companies > that > > > > >>> redistribute Pig and that are allergic to GPL software. > > > > >>> > > > > >>> So, as I said before, if you wanted to continue with that library > > and > > > > >> they > > > > >>> are not willing to relicense it then it would have to be bolted > on > > > > after > > > > >>> Apache Pig is built. Nothing stops you from doing this by > > > downloading > > > > >>> Apache Pig, adding this library and your code, and > redistributing, > > > > though > > > > >>> it wouldn't then be open to all Pig users. > > > > >>> > > > > >>> Alan. > > > > >>> > > > > >>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote: > > > > >>> > > > > >>>> Thanks for your response. I was never good at differentiating > all > > > > those > > > > >>>> open source licenses. I mean what is the point making open > source > > > > >>> licenses > > > > >>>> if it blocks me from using a library in an open source project. > > Any > > > > >> way, > > > > >>>> I'm not going into debate here. Just one question, if we use JTS > > as > > > a > > > > >>>> library (jar file) without adding the code in Pig, is it still a > > > > >>> violation? > > > > >>>> We'll use ivy, for example, to download the jar file when > > compiling. > > > > >>>> On May 1, 2013 7:50 PM, "Alan Gates" > > wrote: > > > > >>>> > > > > >>>>> Passing on the technical details for a moment, I see a > licensing > > > > >> issue. > > > > >>>>> JTS is licensed under LGPL. Apache projects cannot contain or > > ship > > > > >>>>> [L]GPL. Apache does not meet the requirements of GPL and thus > we > > > > >> cannot > > > > >>>>> repackage their code. If you wanted to go forward using that > > class > > > > >> this > > > > >>>>> would have to be packaged as an add on that was downloaded > > > separately > > > > >>> and > > > > >>>>> not from Apache. Another option is to work with the JTS > > community > > > > and > > > > >>> see > > > > >>>>> if they are willing to dual license their code under BSD or > > Apache > > > > >>> license > > > > >>>>> so that Pig could include it. If neither of those are an > option > > > you > > > > >>> would > > > > >>>>> need to come up with a new class to contain your spatial data. > > > > >>>>> > > > > >>>>> Alan. > > > > >>>>> > > > > >>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote: > > > > >>>>> > > > > >>>>>> Hi all, > > > > >>>>>> First, sorry for the long email. I wanted to put all my > thoughts > > > > here > > > > >>>>> and > > > > >>>>>> get your feedback. > > > > >>>>>> I'm proposing a major addition to Pig that will greatly > increase > > > its > > > > >>>>>> functionality and user base. It is simply to add spatial > support > > > to > > > > >> the > > > > >>>>>> language and the framework. I've already started working on > that > > > but > > > > >> I > > > > >>>>>> don't want it to be just another branch. I want it, > eventually, > > to > > > > be > > > > >>>>>> merged with the trunk of Apache Pig. So, I'm sending this > email > > > > >> mainly > > > > >>> to > > > > >>>>>> reach out the main contributors of Pig to see the feasibility > of > > > > >> this. > > > > >>>>>> This addition is a part of a big project we have been working > on > > > in > > > > >>>>>> University of Minnesota; the project is called Spatial Hadoop. > > > > >>>>>> http://spatialhadoop.cs.umn.edu. It's about building a > > MapReduce > > > > >>>>> framework > > > > >>>>>> (Hadoop) that is capable of maintaining and analyzing spatial > > data > > > > >>>>>> efficiently. I'm the main guy behind that project and since we > > > > >> released > > > > >>>>> its > > > > >>>>>> first version, we received very encouraging responses from > > > different > > > > >>>>> groups > > > > >>>>>> in the research and industrial community. I'm sure the > addition > > we > > > > >> want > > > > >>>>> to > > > > >>>>>> make to Pig Latin will be widely accepted by the people in the > > > > >> spatial > > > > >>>>>> community. > > > > >>>>>> I'm proposing a plan here while we're still in the early > phases > > of > > > > >> this > > > > >>>>>> task to be able to discuss it with the main contributors and > see > > > its > > > > >>>>>> feasibility. First of all, I think that we need to change the > > core > > > > of > > > > >>> Pig > > > > >>>>>> to be able to support spatial data. Providing a set of UDFs > only > > > is > > > > >> not > > > > >>>>>> enough. The main reason is that Pig Latin does not provide a > way > > > to > > > > >>>>> create > > > > >>>>>> a new data type which is needed for spatial data. Once we have > > the > > > > >>>>> spatial > > > > >>>>>> data types we need, the functionality can be expanded using > more > > > > >> UDFs. > > > > >>>>>> > > > > >>>>>> Here's the plan as I see it. > > > > >>>>>> 1- Introduce a new primitive data type Geometry which > represents > > > all > > > > >>>>>> spatial data types. In the underlying system, this will map to > > > > >>>>>> com.vividsolutions.jts.geom.Geometry. This is a class from > Java > > > > >>> Topology > > > > >>>>>> Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], > a > > > > >> stable > > > > >>>>> and > > > > >>>>>> efficient open source Java library for spatial data types and > > > > >>> algorithms. > > > > >>>>>> It is very popular in the spatial community and a C++ port of > it > > > is > > > > >>> used > > > > >>>>> in > > > > >>>>>> PostGIS [http://postgis.net/] (a spatial library for > Postgres). > > > JTS > > > > >>> also > > > > >>>>>> conforms with Open Geospatial Consortium (OGC) [ > > > > >>>>>> http://www.opengeospatial.org/] which is an open standard for > > the > > > > >>>>> spatial > > > > >>>>>> data types. The Geometry data type is read from and written to > > > text > > > > >>> files > > > > >>>>>> using the Well Known Text (WKT) format. There is also a way to > > > > >> convert > > > > >>> it > > > > >>>>>> to/from binary so that it can work with binary files and > > streams. > > > > >>>>>> 2- Add functions that manipulate spatial data types. These > will > > be > > > > >>> added > > > > >>>>> as > > > > >>>>>> UDFs and we will not need to mess with the internals of Pig. > > Most > > > > >>>>> probably, > > > > >>>>>> there will be one new class for each operation (e.g., union or > > > > >>>>>> intersection). I think it will be good to put these new > > operations > > > > >>> inside > > > > >>>>>> the core of Pig so that users can use it without having to > write > > > the > > > > >>>>> fully > > > > >>>>>> qualified class name. Also, since there is no way to > implicitly > > > cast > > > > >> a > > > > >>>>>> spatial data type to a non-spatial data types, there will not > be > > > any > > > > >>>>>> conflicts in existing operations or new operations. All new > > > > >> operations, > > > > >>>>> and > > > > >>>>>> only the new operations, will be working on spatial data > types. > > > Here > > > > >> is > > > > >>>>> an > > > > >>>>>> initial list of operations that can be added. All those > > operations > > > > >> are > > > > >>>>>> already implemented in JTS and the UDFs added to Pig will be > > just > > > > >>>>> wrappers > > > > >>>>>> around them. > > > > >>>>>> **Predicates (used for spatial filtering) > > > > >>>>>> Equals > > > > >>>>>> Disjoint > > > > >>>>>> Intersects > > > > >>>>>> Touches > > > > >>>>>> Crosses > > > > >>>>>> Within > > > > >>>>>> Contains > > > > >>>>>> Overlaps > > > > >>>>>> > > > > >>>>>> **Operations > > > > >>>>>> Envelope > > > > >>>>>> Area > > > > >>>>>> Length > > > > >>>>>> Buffer > > > > >>>>>> ConvexHull > > > > >>>>>> Intersection > > > > >>>>>> Union > > > > >>>>>> Difference > > > > >>>>>> SymDifference > > > > >>>>>> > > > > >>>>>> **Aggregate functions > > > > >>>>>> Accum > > > > >>>>>> ConvexHull > > > > >>>>>> Union > > > > >>>>>> > > > > >>>>>> 3- The third step is to implement spatial indexes (e.g., Grid > or > > > > >>>>> R-tree). A > > > > >>>>>> Pig loader and Pig output classes will be created for those > > > indexes. > > > > >>> Note > > > > >>>>>> that currently we have SpatialOutputFormat and > > SpatialInputFormat > > > > for > > > > >>>>> those > > > > >>>>>> indexes inside the Spatial Hadoop project, but we need to > tweak > > > them > > > > >> to > > > > >>>>>> work with Pig. > > > > >>>>>> > > > > >>>>>> 4- (Advanced) Implement more sophisticated algorithms for > > spatial > > > > >>>>>> operations that utilize the indexes. For example, we can have > a > > > > >>> specific > > > > >>>>>> algorithm for spatial range query or spatial join. Again, we > > > already > > > > >>> have > > > > >>>>>> algorithms built for different operations implemented in > Spatial > > > > >> Hadoop > > > > >>>>> as > > > > >>>>>> MapReduce programs, but they will need to be modified to work > in > > > Pig > > > > >>>>>> environment and get to work with other operations. > > > > >>>>>> > > > > >>>>>> This is my whole plan for the spatial extension to Pig. I've > > > already > > > > >>>>>> started with the first step but as I mentioned earlier, I > don't > > > want > > > > >> to > > > > >>>>> do > > > > >>>>>> the work for our project and then the work gets forgotten. I > > want > > > to > > > > >>>>>> contribute to Pig and do my research at the same time. If you > > > think > > > > >> the > > > > >>>>>> plan is plausible, I'll open JIRA issues for the above tasks > and > > > > >> start > > > > >>>>>> shipping patches to do the stuff. I'll conform with the > > standards > > > of > > > > >>> the > > > > >>>>>> project such as adding tests and well commenting the code. > > > > >>>>>> Sorry for the long email and hope to hear back from you. > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> Best regards, > > > > >>>>>> Ahmed Eldawy > > > > >>>>> > > > > >>>>> > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > --bcaec520f501c8dc5b04dc0fd087--