pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Eldawy <aseld...@gmail.com>
Subject Re: A major addition to Pig. Working with spatial data
Date Mon, 06 May 2013 17:12:51 GMT
I contacted solr developers to see how JTS can be included in an Apache
project. See
http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/
As far as I understand, they did not include it in the main solr project,
rather, they created a separate project (spatial 4j) which is still
licensed under Apache license and refers to JTS. Users will have to
download JTS libraries separately to make it run. That's pretty much the
same plan that Jonathan mentioned. We will still have the overhead of
serializing/deserializing the shapes each time a function is called. Also,
we will have to use the ugly bytearray data type for spatial data instead
of creating its own data type (e.g., Geometry).
I think using spatial 4j instead of JTS will not be sufficient for our case
as we need to provide an access to all spatial functions of JTS such as
Union, Intersection, Difference, ... etc. This way we can claim conformity
with OGC standards which gives visibility and appreciations of the spatial
community.
I think also that this means I will not add any issues to JIRA as it is now
a separate project. I'm planning to host it on github and have all the
issues there.
Let me know if you have any suggestions or comments.

Thanks
Ahmed


Best regards,
Ahmed Eldawy


On Mon, May 6, 2013 at 9:53 AM, Jonathan Coveney <jcoveney@gmail.com> wrote:

> You can give them all the same label or tag and filter on that later on.
>
>
> 2013/5/6 Ahmed Eldawy <aseldawy@gmail.com>
>
> > Thanks all for taking the time to respond. Danial, I didn't know that
> Solr
> > uses JTS. This is a good finding and we can definitely ask them to see if
> > there is a work around we can do. Jonathan, I thought of the same idea of
> > serializing/deserializing a bytearray each time a UDF is called. The
> > deserialization part is good for letting Pig auto detect spatial types if
> > not set explicitly in the schema. What is the best way to start this? I
> > want to add an initial set of JIRA issues and start working on them but I
> > also need to keep the work grouped in some sense just for organization.
> >
> > Thanks
> > Ahmed
> >
> > Best regards,
> > Ahmed Eldawy
> >
> >
> > On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney <jcoveney@gmail.com>
> > wrote:
> >
> > > I agree that this is cool, and if other projects are using JTS it is
> > worth
> > > talking them to see how. I also agree that licensing is very
> frustrating.
> > >
> > > In the short term, however, while it is annoying to have to manage the
> > > serialization and deserialization yourself, you can have the geometry
> > type
> > > be passed around as a bytearray type. Your UDF's will have to know this
> > and
> > > treat it accordingly, but if you did this then all of the tools could
> be
> > in
> > > an external project on github instead of a branch in Pig. Then, if we
> can
> > > get the licensing done, we could add the Geometry type to Pig. Adding
> > > types, honestly, is kind of tedious but not super difficult, so once
> the
> > > rest is done, that shouldn't be too difficult.
> > >
> > >
> > > 2013/5/4 Russell Jurney <russell.jurney@gmail.com>
> > >
> > > > If a way could be found, this would be an awesome addition to Pig.
> > > >
> > > > Russell Jurney http://datasyndrome.com
> > > >
> > > > On May 3, 2013, at 4:09 PM, Daniel Dai <daijy@hortonworks.com>
> wrote:
> > > >
> > > > > I am not sure how other Apache projects dealing with it? Seems Solr
> > > also
> > > > > has some connector to JTS?
> > > > >
> > > > > Thanks,
> > > > > Daniel
> > > > >
> > > > >
> > > > > On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy <aseldawy@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Thanks Alan for your interest. It's too bad that an open source
> > > > licensing
> > > > >> issue is holding me back from doing some open source work. I
> > > understand
> > > > the
> > > > >> issue and your workarounds make sense. However, as I mentioned
in
> > the
> > > > >> beginning, I don't want to have my own branch of Pig because
it
> > makes
> > > my
> > > > >> extension less portable. I'll think of another way to do it.
I'll
> > ask
> > > > vivid
> > > > >> solutions if they can double license their code although I think
> the
> > > > answer
> > > > >> will be no. I'll also think of a way to ship my extension as
a set
> > of
> > > > jar
> > > > >> files without the need to change the core of Pig. This way, it
can
> > be
> > > > >> easily ported to newer versions of Pig.
> > > > >>
> > > > >> Thanks
> > > > >> Ahmed
> > > > >>
> > > > >> Best regards,
> > > > >> Ahmed Eldawy
> > > > >>
> > > > >>
> > > > >> On Thu, May 2, 2013 at 12:33 PM, Alan Gates <
> gates@hortonworks.com>
> > > > wrote:
> > > > >>
> > > > >>> I know this is frustrating, but the different licenses do
have
> > > > different
> > > > >>> requirements that make it so that Apache can't ship GPL code.
 A
> > > legal
> > > > >>> explanation is at
> > > > http://www.apache.org/licenses/GPL-compatibility.htmlFor additional
> > info
> > > > on the LGPL specific questions see
> > > > >>> http://www.apache.org/legal/3party.html
> > > > >>>
> > > > >>> As far as pulling it in via ivy, the issue isn't so much
where
> the
> > > code
> > > > >>> lives as much as what code we are requiring to make Pig work.
 If
> > > > >> something
> > > > >>> that is [L]GPL is required for Pig it violates Apache rules
as
> > > outlined
> > > > >>> above.  It also would be a show stopper for a lot of companies
> that
> > > > >>> redistribute Pig and that are allergic to GPL software.
> > > > >>>
> > > > >>> So, as I said before, if you wanted to continue with that
library
> > and
> > > > >> they
> > > > >>> are not willing to relicense it then it would have to be
bolted
> on
> > > > after
> > > > >>> Apache Pig is built.  Nothing stops you from doing this by
> > > downloading
> > > > >>> Apache Pig, adding this library and your code, and
> redistributing,
> > > > though
> > > > >>> it wouldn't then be open to all Pig users.
> > > > >>>
> > > > >>> Alan.
> > > > >>>
> > > > >>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote:
> > > > >>>
> > > > >>>> Thanks for your response. I was never good at differentiating
> all
> > > > those
> > > > >>>> open source licenses. I mean what is the point making
open
> source
> > > > >>> licenses
> > > > >>>> if it blocks me from using a library in an open source
project.
> > Any
> > > > >> way,
> > > > >>>> I'm not going into debate here. Just one question, if
we use JTS
> > as
> > > a
> > > > >>>> library (jar file) without adding the code in Pig, is
it still a
> > > > >>> violation?
> > > > >>>> We'll use ivy, for example, to download the jar file
when
> > compiling.
> > > > >>>> On May 1, 2013 7:50 PM, "Alan Gates" <gates@hortonworks.com>
> > wrote:
> > > > >>>>
> > > > >>>>> Passing on the technical details for a moment, I
see a
> licensing
> > > > >> issue.
> > > > >>>>> JTS is licensed under LGPL.  Apache projects cannot
contain or
> > ship
> > > > >>>>> [L]GPL.  Apache does not meet the requirements of
GPL and thus
> we
> > > > >> cannot
> > > > >>>>> repackage their code. If you wanted to go forward
using that
> > class
> > > > >> this
> > > > >>>>> would have to be packaged as an add on that was downloaded
> > > separately
> > > > >>> and
> > > > >>>>> not from Apache.  Another option is to work with
the JTS
> > community
> > > > and
> > > > >>> see
> > > > >>>>> if they are willing to dual license their code under
BSD or
> > Apache
> > > > >>> license
> > > > >>>>> so that Pig could include it.  If neither of those
are an
> option
> > > you
> > > > >>> would
> > > > >>>>> need to come up with a new class to contain your
spatial data.
> > > > >>>>>
> > > > >>>>> Alan.
> > > > >>>>>
> > > > >>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote:
> > > > >>>>>
> > > > >>>>>> Hi all,
> > > > >>>>>> First, sorry for the long email. I wanted to
put all my
> thoughts
> > > > here
> > > > >>>>> and
> > > > >>>>>> get your feedback.
> > > > >>>>>> I'm proposing a major addition to Pig that will
greatly
> increase
> > > its
> > > > >>>>>> functionality and user base. It is simply to
add spatial
> support
> > > to
> > > > >> the
> > > > >>>>>> language and the framework. I've already started
working on
> that
> > > but
> > > > >> I
> > > > >>>>>> don't want it to be just another branch. I want
it,
> eventually,
> > to
> > > > be
> > > > >>>>>> merged with the trunk of Apache Pig. So, I'm
sending this
> email
> > > > >> mainly
> > > > >>> to
> > > > >>>>>> reach out the main contributors of Pig to see
the feasibility
> of
> > > > >> this.
> > > > >>>>>> This addition is a part of a big project we have
been working
> on
> > > in
> > > > >>>>>> University of Minnesota; the project is called
Spatial Hadoop.
> > > > >>>>>> http://spatialhadoop.cs.umn.edu. It's about building
a
> > MapReduce
> > > > >>>>> framework
> > > > >>>>>> (Hadoop) that is capable of maintaining and analyzing
spatial
> > data
> > > > >>>>>> efficiently. I'm the main guy behind that project
and since we
> > > > >> released
> > > > >>>>> its
> > > > >>>>>> first version, we received very encouraging responses
from
> > > different
> > > > >>>>> groups
> > > > >>>>>> in the research and industrial community. I'm
sure the
> addition
> > we
> > > > >> want
> > > > >>>>> to
> > > > >>>>>> make to Pig Latin will be widely accepted by
the people in the
> > > > >> spatial
> > > > >>>>>> community.
> > > > >>>>>> I'm proposing a plan here while we're still in
the early
> phases
> > of
> > > > >> this
> > > > >>>>>> task to be able to discuss it with the main contributors
and
> see
> > > its
> > > > >>>>>> feasibility. First of all, I think that we need
to change the
> > core
> > > > of
> > > > >>> Pig
> > > > >>>>>> to be able to support spatial data. Providing
a set of UDFs
> only
> > > is
> > > > >> not
> > > > >>>>>> enough. The main reason is that Pig Latin does
not provide a
> way
> > > to
> > > > >>>>> create
> > > > >>>>>> a new data type which is needed for spatial data.
Once we have
> > the
> > > > >>>>> spatial
> > > > >>>>>> data types we need, the functionality can be
expanded using
> more
> > > > >> UDFs.
> > > > >>>>>>
> > > > >>>>>> Here's the plan as I see it.
> > > > >>>>>> 1- Introduce a new primitive data type Geometry
which
> represents
> > > all
> > > > >>>>>> spatial data types. In the underlying system,
this will map to
> > > > >>>>>> com.vividsolutions.jts.geom.Geometry. This is
a class from
> Java
> > > > >>> Topology
> > > > >>>>>> Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm],
> a
> > > > >> stable
> > > > >>>>> and
> > > > >>>>>> efficient open source Java library for spatial
data types and
> > > > >>> algorithms.
> > > > >>>>>> It is very popular in the spatial community and
a C++ port of
> it
> > > is
> > > > >>> used
> > > > >>>>> in
> > > > >>>>>> PostGIS [http://postgis.net/] (a spatial library
for
> Postgres).
> > > JTS
> > > > >>> also
> > > > >>>>>> conforms with Open Geospatial Consortium (OGC)
[
> > > > >>>>>> http://www.opengeospatial.org/] which is an open
standard for
> > the
> > > > >>>>> spatial
> > > > >>>>>> data types. The Geometry data type is read from
and written to
> > > text
> > > > >>> files
> > > > >>>>>> using the Well Known Text (WKT) format. There
is also a way to
> > > > >> convert
> > > > >>> it
> > > > >>>>>> to/from binary so that it can work with binary
files and
> > streams.
> > > > >>>>>> 2- Add functions that manipulate spatial data
types. These
> will
> > be
> > > > >>> added
> > > > >>>>> as
> > > > >>>>>> UDFs and we will not need to mess with the internals
of Pig.
> > Most
> > > > >>>>> probably,
> > > > >>>>>> there will be one new class for each operation
(e.g., union or
> > > > >>>>>> intersection). I think it will be good to put
these new
> > operations
> > > > >>> inside
> > > > >>>>>> the core of Pig so that users can use it without
having to
> write
> > > the
> > > > >>>>> fully
> > > > >>>>>> qualified class name. Also, since there is no
way to
> implicitly
> > > cast
> > > > >> a
> > > > >>>>>> spatial data type to a non-spatial data types,
there will not
> be
> > > any
> > > > >>>>>> conflicts in existing operations or new operations.
All new
> > > > >> operations,
> > > > >>>>> and
> > > > >>>>>> only the new operations, will be working on spatial
data
> types.
> > > Here
> > > > >> is
> > > > >>>>> an
> > > > >>>>>> initial list of operations that can be added.
All those
> > operations
> > > > >> are
> > > > >>>>>> already implemented in JTS and the UDFs added
to Pig will be
> > just
> > > > >>>>> wrappers
> > > > >>>>>> around them.
> > > > >>>>>> **Predicates (used for spatial filtering)
> > > > >>>>>> Equals
> > > > >>>>>> Disjoint
> > > > >>>>>> Intersects
> > > > >>>>>> Touches
> > > > >>>>>> Crosses
> > > > >>>>>> Within
> > > > >>>>>> Contains
> > > > >>>>>> Overlaps
> > > > >>>>>>
> > > > >>>>>> **Operations
> > > > >>>>>> Envelope
> > > > >>>>>> Area
> > > > >>>>>> Length
> > > > >>>>>> Buffer
> > > > >>>>>> ConvexHull
> > > > >>>>>> Intersection
> > > > >>>>>> Union
> > > > >>>>>> Difference
> > > > >>>>>> SymDifference
> > > > >>>>>>
> > > > >>>>>> **Aggregate functions
> > > > >>>>>> Accum
> > > > >>>>>> ConvexHull
> > > > >>>>>> Union
> > > > >>>>>>
> > > > >>>>>> 3- The third step is to implement spatial indexes
(e.g., Grid
> or
> > > > >>>>> R-tree). A
> > > > >>>>>> Pig loader and Pig output classes will be created
for those
> > > indexes.
> > > > >>> Note
> > > > >>>>>> that currently we have SpatialOutputFormat and
> > SpatialInputFormat
> > > > for
> > > > >>>>> those
> > > > >>>>>> indexes inside the Spatial Hadoop project, but
we need to
> tweak
> > > them
> > > > >> to
> > > > >>>>>> work with Pig.
> > > > >>>>>>
> > > > >>>>>> 4- (Advanced) Implement more sophisticated algorithms
for
> > spatial
> > > > >>>>>> operations that utilize the indexes. For example,
we can have
> a
> > > > >>> specific
> > > > >>>>>> algorithm for spatial range query or spatial
join. Again, we
> > > already
> > > > >>> have
> > > > >>>>>> algorithms built for different operations implemented
in
> Spatial
> > > > >> Hadoop
> > > > >>>>> as
> > > > >>>>>> MapReduce programs, but they will need to be
modified to work
> in
> > > Pig
> > > > >>>>>> environment and get to work with other operations.
> > > > >>>>>>
> > > > >>>>>> This is my whole plan for the spatial extension
to Pig. I've
> > > already
> > > > >>>>>> started with the first step but as I mentioned
earlier, I
> don't
> > > want
> > > > >> to
> > > > >>>>> do
> > > > >>>>>> the work for our project and then the work gets
forgotten. I
> > want
> > > to
> > > > >>>>>> contribute to Pig and do my research at the same
time. If you
> > > think
> > > > >> the
> > > > >>>>>> plan is plausible, I'll open JIRA issues for
the above tasks
> and
> > > > >> start
> > > > >>>>>> shipping patches to do the stuff. I'll conform
with the
> > standards
> > > of
> > > > >>> the
> > > > >>>>>> project such as adding tests and well commenting
the code.
> > > > >>>>>> Sorry for the long email and hope to hear back
from you.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Best regards,
> > > > >>>>>> Ahmed Eldawy
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message