pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Coveney <jcove...@gmail.com>
Subject Re: A major addition to Pig. Working with spatial data
Date Mon, 06 May 2013 20:09:22 GMT
Nick: the only issue is that the way types are implemented in Pig don't
allow us to easily "plug-in" types externally. Adding support for that
would be cool, but a fair bit of work.


2013/5/6 Nick Dimiduk <ndimiduk@gmail.com>

> I'm to a lawyer, but I see no reason why this cannot be an external
> extension to Pig. It would behave the same way PostGIS is an external
> extension to Postgres. Any Apache issues would be toward general
> purpose enhancements, not specific to your project.
>
> Good on you!
> -n
>
> On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy <aseldawy@gmail.com> wrote:
>
> > I contacted solr developers to see how JTS can be included in an Apache
> > project. See
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/
> > As far as I understand, they did not include it in the main solr project,
> > rather, they created a separate project (spatial 4j) which is still
> > licensed under Apache license and refers to JTS. Users will have to
> > download JTS libraries separately to make it run. That's pretty much the
> > same plan that Jonathan mentioned. We will still have the overhead of
> > serializing/deserializing the shapes each time a function is called.
> Also,
> > we will have to use the ugly bytearray data type for spatial data instead
> > of creating its own data type (e.g., Geometry).
> > I think using spatial 4j instead of JTS will not be sufficient for our
> case
> > as we need to provide an access to all spatial functions of JTS such as
> > Union, Intersection, Difference, ... etc. This way we can claim
> conformity
> > with OGC standards which gives visibility and appreciations of the
> spatial
> > community.
> > I think also that this means I will not add any issues to JIRA as it is
> now
> > a separate project. I'm planning to host it on github and have all the
> > issues there.
> > Let me know if you have any suggestions or comments.
> >
> > Thanks
> > Ahmed
> >
> >
> > Best regards,
> > Ahmed Eldawy
> >
> >
> > On Mon, May 6, 2013 at 9:53 AM, Jonathan Coveney <jcoveney@gmail.com>
> > wrote:
> >
> > > You can give them all the same label or tag and filter on that later
> on.
> > >
> > >
> > > 2013/5/6 Ahmed Eldawy <aseldawy@gmail.com>
> > >
> > > > Thanks all for taking the time to respond. Danial, I didn't know that
> > > Solr
> > > > uses JTS. This is a good finding and we can definitely ask them to
> see
> > if
> > > > there is a work around we can do. Jonathan, I thought of the same
> idea
> > of
> > > > serializing/deserializing a bytearray each time a UDF is called. The
> > > > deserialization part is good for letting Pig auto detect spatial
> types
> > if
> > > > not set explicitly in the schema. What is the best way to start
> this? I
> > > > want to add an initial set of JIRA issues and start working on them
> > but I
> > > > also need to keep the work grouped in some sense just for
> organization.
> > > >
> > > > Thanks
> > > > Ahmed
> > > >
> > > > Best regards,
> > > > Ahmed Eldawy
> > > >
> > > >
> > > > On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney <jcoveney@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I agree that this is cool, and if other projects are using JTS it
> is
> > > > worth
> > > > > talking them to see how. I also agree that licensing is very
> > > frustrating.
> > > > >
> > > > > In the short term, however, while it is annoying to have to manage
> > the
> > > > > serialization and deserialization yourself, you can have the
> geometry
> > > > type
> > > > > be passed around as a bytearray type. Your UDF's will have to know
> > this
> > > > and
> > > > > treat it accordingly, but if you did this then all of the tools
> could
> > > be
> > > > in
> > > > > an external project on github instead of a branch in Pig. Then, if
> we
> > > can
> > > > > get the licensing done, we could add the Geometry type to Pig.
> Adding
> > > > > types, honestly, is kind of tedious but not super difficult, so
> once
> > > the
> > > > > rest is done, that shouldn't be too difficult.
> > > > >
> > > > >
> > > > > 2013/5/4 Russell Jurney <russell.jurney@gmail.com>
> > > > >
> > > > > > If a way could be found, this would be an awesome addition to
> Pig.
> > > > > >
> > > > > > Russell Jurney http://datasyndrome.com
> > > > > >
> > > > > > On May 3, 2013, at 4:09 PM, Daniel Dai <daijy@hortonworks.com>
> > > wrote:
> > > > > >
> > > > > > > I am not sure how other Apache projects dealing with it?
Seems
> > Solr
> > > > > also
> > > > > > > has some connector to JTS?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Daniel
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy <
> > aseldawy@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks Alan for your interest. It's too bad that an
open
> source
> > > > > > licensing
> > > > > > >> issue is holding me back from doing some open source
work. I
> > > > > understand
> > > > > > the
> > > > > > >> issue and your workarounds make sense. However, as
I mentioned
> > in
> > > > the
> > > > > > >> beginning, I don't want to have my own branch of Pig
because
> it
> > > > makes
> > > > > my
> > > > > > >> extension less portable. I'll think of another way
to do it.
> > I'll
> > > > ask
> > > > > > vivid
> > > > > > >> solutions if they can double license their code although
I
> think
> > > the
> > > > > > answer
> > > > > > >> will be no. I'll also think of a way to ship my extension
as a
> > set
> > > > of
> > > > > > jar
> > > > > > >> files without the need to change the core of Pig. This
way, it
> > can
> > > > be
> > > > > > >> easily ported to newer versions of Pig.
> > > > > > >>
> > > > > > >> Thanks
> > > > > > >> Ahmed
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >> Ahmed Eldawy
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 2, 2013 at 12:33 PM, Alan Gates <
> > > gates@hortonworks.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >>> I know this is frustrating, but the different licenses
do
> have
> > > > > > different
> > > > > > >>> requirements that make it so that Apache can't
ship GPL code.
> >  A
> > > > > legal
> > > > > > >>> explanation is at
> > > > > >
> http://www.apache.org/licenses/GPL-compatibility.htmlForadditional
> > > > info
> > > > > > on the LGPL specific questions see
> > > > > > >>> http://www.apache.org/legal/3party.html
> > > > > > >>>
> > > > > > >>> As far as pulling it in via ivy, the issue isn't
so much
> where
> > > the
> > > > > code
> > > > > > >>> lives as much as what code we are requiring to
make Pig work.
> >  If
> > > > > > >> something
> > > > > > >>> that is [L]GPL is required for Pig it violates
Apache rules
> as
> > > > > outlined
> > > > > > >>> above.  It also would be a show stopper for a lot
of
> companies
> > > that
> > > > > > >>> redistribute Pig and that are allergic to GPL software.
> > > > > > >>>
> > > > > > >>> So, as I said before, if you wanted to continue
with that
> > library
> > > > and
> > > > > > >> they
> > > > > > >>> are not willing to relicense it then it would have
to be
> bolted
> > > on
> > > > > > after
> > > > > > >>> Apache Pig is built.  Nothing stops you from doing
this by
> > > > > downloading
> > > > > > >>> Apache Pig, adding this library and your code,
and
> > > redistributing,
> > > > > > though
> > > > > > >>> it wouldn't then be open to all Pig users.
> > > > > > >>>
> > > > > > >>> Alan.
> > > > > > >>>
> > > > > > >>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote:
> > > > > > >>>
> > > > > > >>>> Thanks for your response. I was never good
at
> differentiating
> > > all
> > > > > > those
> > > > > > >>>> open source licenses. I mean what is the point
making open
> > > source
> > > > > > >>> licenses
> > > > > > >>>> if it blocks me from using a library in an
open source
> > project.
> > > > Any
> > > > > > >> way,
> > > > > > >>>> I'm not going into debate here. Just one question,
if we use
> > JTS
> > > > as
> > > > > a
> > > > > > >>>> library (jar file) without adding the code
in Pig, is it
> > still a
> > > > > > >>> violation?
> > > > > > >>>> We'll use ivy, for example, to download the
jar file when
> > > > compiling.
> > > > > > >>>> On May 1, 2013 7:50 PM, "Alan Gates" <gates@hortonworks.com
> >
> > > > wrote:
> > > > > > >>>>
> > > > > > >>>>> Passing on the technical details for a
moment, I see a
> > > licensing
> > > > > > >> issue.
> > > > > > >>>>> JTS is licensed under LGPL.  Apache projects
cannot contain
> > or
> > > > ship
> > > > > > >>>>> [L]GPL.  Apache does not meet the requirements
of GPL and
> > thus
> > > we
> > > > > > >> cannot
> > > > > > >>>>> repackage their code. If you wanted to
go forward using
> that
> > > > class
> > > > > > >> this
> > > > > > >>>>> would have to be packaged as an add on
that was downloaded
> > > > > separately
> > > > > > >>> and
> > > > > > >>>>> not from Apache.  Another option is to
work with the JTS
> > > > community
> > > > > > and
> > > > > > >>> see
> > > > > > >>>>> if they are willing to dual license their
code under BSD or
> > > > Apache
> > > > > > >>> license
> > > > > > >>>>> so that Pig could include it.  If neither
of those are an
> > > option
> > > > > you
> > > > > > >>> would
> > > > > > >>>>> need to come up with a new class to contain
your spatial
> > data.
> > > > > > >>>>>
> > > > > > >>>>> Alan.
> > > > > > >>>>>
> > > > > > >>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy
wrote:
> > > > > > >>>>>
> > > > > > >>>>>> Hi all,
> > > > > > >>>>>> First, sorry for the long email. I
wanted to put all my
> > > thoughts
> > > > > > here
> > > > > > >>>>> and
> > > > > > >>>>>> get your feedback.
> > > > > > >>>>>> I'm proposing a major addition to Pig
that will greatly
> > > increase
> > > > > its
> > > > > > >>>>>> functionality and user base. It is
simply to add spatial
> > > support
> > > > > to
> > > > > > >> the
> > > > > > >>>>>> language and the framework. I've already
started working
> on
> > > that
> > > > > but
> > > > > > >> I
> > > > > > >>>>>> don't want it to be just another branch.
I want it,
> > > eventually,
> > > > to
> > > > > > be
> > > > > > >>>>>> merged with the trunk of Apache Pig.
So, I'm sending this
> > > email
> > > > > > >> mainly
> > > > > > >>> to
> > > > > > >>>>>> reach out the main contributors of
Pig to see the
> > feasibility
> > > of
> > > > > > >> this.
> > > > > > >>>>>> This addition is a part of a big project
we have been
> > working
> > > on
> > > > > in
> > > > > > >>>>>> University of Minnesota; the project
is called Spatial
> > Hadoop.
> > > > > > >>>>>> http://spatialhadoop.cs.umn.edu. It's
about building a
> > > > MapReduce
> > > > > > >>>>> framework
> > > > > > >>>>>> (Hadoop) that is capable of maintaining
and analyzing
> > spatial
> > > > data
> > > > > > >>>>>> efficiently. I'm the main guy behind
that project and
> since
> > we
> > > > > > >> released
> > > > > > >>>>> its
> > > > > > >>>>>> first version, we received very encouraging
responses from
> > > > > different
> > > > > > >>>>> groups
> > > > > > >>>>>> in the research and industrial community.
I'm sure the
> > > addition
> > > > we
> > > > > > >> want
> > > > > > >>>>> to
> > > > > > >>>>>> make to Pig Latin will be widely accepted
by the people in
> > the
> > > > > > >> spatial
> > > > > > >>>>>> community.
> > > > > > >>>>>> I'm proposing a plan here while we're
still in the early
> > > phases
> > > > of
> > > > > > >> this
> > > > > > >>>>>> task to be able to discuss it with
the main contributors
> and
> > > see
> > > > > its
> > > > > > >>>>>> feasibility. First of all, I think
that we need to change
> > the
> > > > core
> > > > > > of
> > > > > > >>> Pig
> > > > > > >>>>>> to be able to support spatial data.
Providing a set of
> UDFs
> > > only
> > > > > is
> > > > > > >> not
> > > > > > >>>>>> enough. The main reason is that Pig
Latin does not
> provide a
> > > way
> > > > > to
> > > > > > >>>>> create
> > > > > > >>>>>> a new data type which is needed for
spatial data. Once we
> > have
> > > > the
> > > > > > >>>>> spatial
> > > > > > >>>>>> data types we need, the functionality
can be expanded
> using
> > > more
> > > > > > >> UDFs.
> > > > > > >>>>>>
> > > > > > >>>>>> Here's the plan as I see it.
> > > > > > >>>>>> 1- Introduce a new primitive data type
Geometry which
> > > represents
> > > > > all
> > > > > > >>>>>> spatial data types. In the underlying
system, this will
> map
> > to
> > > > > > >>>>>> com.vividsolutions.jts.geom.Geometry.
This is a class from
> > > Java
> > > > > > >>> Topology
> > > > > > >>>>>> Suite (JTS) [
> http://www.vividsolutions.com/jts/JTSHome.htm
> > ],
> > > a
> > > > > > >> stable
> > > > > > >>>>> and
> > > > > > >>>>>> efficient open source Java library
for spatial data types
> > and
> > > > > > >>> algorithms.
> > > > > > >>>>>> It is very popular in the spatial community
and a C++ port
> > of
> > > it
> > > > > is
> > > > > > >>> used
> > > > > > >>>>> in
> > > > > > >>>>>> PostGIS [http://postgis.net/] (a spatial
library for
> > > Postgres).
> > > > > JTS
> > > > > > >>> also
> > > > > > >>>>>> conforms with Open Geospatial Consortium
(OGC) [
> > > > > > >>>>>> http://www.opengeospatial.org/] which
is an open standard
> > for
> > > > the
> > > > > > >>>>> spatial
> > > > > > >>>>>> data types. The Geometry data type
is read from and
> written
> > to
> > > > > text
> > > > > > >>> files
> > > > > > >>>>>> using the Well Known Text (WKT) format.
There is also a
> way
> > to
> > > > > > >> convert
> > > > > > >>> it
> > > > > > >>>>>> to/from binary so that it can work
with binary files and
> > > > streams.
> > > > > > >>>>>> 2- Add functions that manipulate spatial
data types. These
> > > will
> > > > be
> > > > > > >>> added
> > > > > > >>>>> as
> > > > > > >>>>>> UDFs and we will not need to mess with
the internals of
> Pig.
> > > > Most
> > > > > > >>>>> probably,
> > > > > > >>>>>> there will be one new class for each
operation (e.g.,
> union
> > or
> > > > > > >>>>>> intersection). I think it will be good
to put these new
> > > > operations
> > > > > > >>> inside
> > > > > > >>>>>> the core of Pig so that users can use
it without having to
> > > write
> > > > > the
> > > > > > >>>>> fully
> > > > > > >>>>>> qualified class name. Also, since there
is no way to
> > > implicitly
> > > > > cast
> > > > > > >> a
> > > > > > >>>>>> spatial data type to a non-spatial
data types, there will
> > not
> > > be
> > > > > any
> > > > > > >>>>>> conflicts in existing operations or
new operations. All
> new
> > > > > > >> operations,
> > > > > > >>>>> and
> > > > > > >>>>>> only the new operations, will be working
on spatial data
> > > types.
> > > > > Here
> > > > > > >> is
> > > > > > >>>>> an
> > > > > > >>>>>> initial list of operations that can
be added. All those
> > > > operations
> > > > > > >> are
> > > > > > >>>>>> already implemented in JTS and the
UDFs added to Pig will
> be
> > > > just
> > > > > > >>>>> wrappers
> > > > > > >>>>>> around them.
> > > > > > >>>>>> **Predicates (used for spatial filtering)
> > > > > > >>>>>> Equals
> > > > > > >>>>>> Disjoint
> > > > > > >>>>>> Intersects
> > > > > > >>>>>> Touches
> > > > > > >>>>>> Crosses
> > > > > > >>>>>> Within
> > > > > > >>>>>> Contains
> > > > > > >>>>>> Overlaps
> > > > > > >>>>>>
> > > > > > >>>>>> **Operations
> > > > > > >>>>>> Envelope
> > > > > > >>>>>> Area
> > > > > > >>>>>> Length
> > > > > > >>>>>> Buffer
> > > > > > >>>>>> ConvexHull
> > > > > > >>>>>> Intersection
> > > > > > >>>>>> Union
> > > > > > >>>>>> Difference
> > > > > > >>>>>> SymDifference
> > > > > > >>>>>>
> > > > > > >>>>>> **Aggregate functions
> > > > > > >>>>>> Accum
> > > > > > >>>>>> ConvexHull
> > > > > > >>>>>> Union
> > > > > > >>>>>>
> > > > > > >>>>>> 3- The third step is to implement spatial
indexes (e.g.,
> > Grid
> > > or
> > > > > > >>>>> R-tree). A
> > > > > > >>>>>> Pig loader and Pig output classes will
be created for
> those
> > > > > indexes.
> > > > > > >>> Note
> > > > > > >>>>>> that currently we have SpatialOutputFormat
and
> > > > SpatialInputFormat
> > > > > > for
> > > > > > >>>>> those
> > > > > > >>>>>> indexes inside the Spatial Hadoop project,
but we need to
> > > tweak
> > > > > them
> > > > > > >> to
> > > > > > >>>>>> work with Pig.
> > > > > > >>>>>>
> > > > > > >>>>>> 4- (Advanced) Implement more sophisticated
algorithms for
> > > > spatial
> > > > > > >>>>>> operations that utilize the indexes.
For example, we can
> > have
> > > a
> > > > > > >>> specific
> > > > > > >>>>>> algorithm for spatial range query or
spatial join. Again,
> we
> > > > > already
> > > > > > >>> have
> > > > > > >>>>>> algorithms built for different operations
implemented in
> > > Spatial
> > > > > > >> Hadoop
> > > > > > >>>>> as
> > > > > > >>>>>> MapReduce programs, but they will need
to be modified to
> > work
> > > in
> > > > > Pig
> > > > > > >>>>>> environment and get to work with other
operations.
> > > > > > >>>>>>
> > > > > > >>>>>> This is my whole plan for the spatial
extension to Pig.
> I've
> > > > > already
> > > > > > >>>>>> started with the first step but as
I mentioned earlier, I
> > > don't
> > > > > want
> > > > > > >> to
> > > > > > >>>>> do
> > > > > > >>>>>> the work for our project and then the
work gets
> forgotten. I
> > > > want
> > > > > to
> > > > > > >>>>>> contribute to Pig and do my research
at the same time. If
> > you
> > > > > think
> > > > > > >> the
> > > > > > >>>>>> plan is plausible, I'll open JIRA issues
for the above
> tasks
> > > and
> > > > > > >> start
> > > > > > >>>>>> shipping patches to do the stuff. I'll
conform with the
> > > > standards
> > > > > of
> > > > > > >>> the
> > > > > > >>>>>> project such as adding tests and well
commenting the code.
> > > > > > >>>>>> Sorry for the long email and hope to
hear back from you.
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> Best regards,
> > > > > > >>>>>> Ahmed Eldawy
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message