spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Ellrott <kellr...@soe.ucsc.edu>
Subject Re: Implementing TinkerPop on top of GraphX
Date Tue, 18 Nov 2014 22:23:07 GMT
The new Tinkerpop3 API was different enough from V2, that it was worth
starting a new implementation rather then trying to completely refactor my
old code.
I've started a new project: https://github.com/kellrott/spark-gremlin which
compiles and runs the first set of unit tests (which it completely fails).
Most of the classes are structured in the same way they are in the Gigraph
implementation. There isn't much actual GraphX code in the project yet,
just a framework to start working in.
Hopefully this will keep the conversation going.

Kyle

On Fri, Nov 7, 2014 at 11:17 AM, Kushal Datta <kushal.datta@gmail.com>
wrote:

> I think if we are going to use GraphX as the query engine in Tinkerpop3,
> then the Tinkerpop3 community is the right platform to further the
> discussion.
>
> The reason I asked the question on improving APIs in GraphX is because why
> only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has some
> good subgraph matching query interfaces which I believe can be distributed
> using GraphX apis.
>
> An edge ID is an internal attribute of the edge generated automatically,
> mostly hidden from the user. That's why adding it as an edge property might
> not be a good idea. There are several little differences like this. E.g. in
> Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
> executed in Giraph directly. The side-effect operators are mapped to
> Map-Reduce functions. In the implementation we are talking about, all of
> these operations can be done within GraphX. I will be interested to
> co-develop the query engine.
>
> @Reynold, I agree. And as I said earlier, the apis should be designed in
> such a way that it can be used in any Graph DSL.
>
> On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott <kellrott@soe.ucsc.edu>
> wrote:
>
>> Who here would be interested in helping to work on an implementation of
>> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
>> in the Spark discussion group, or should it migrate to the Gremlin message
>> group?
>>
>> Reynold is right that there will be inherent mismatches in the APIs, and
>> there will need to be some discussions with the GraphX group about the best
>> way to go. One example would be edge ids. GraphX has vertex ids, but no
>> explicit edges ids, while Gremlin has both. Edge ids could be put into the
>> attr field, but then that means the user would have to explicitly subclass
>> their edge attribute to the edge attribute interface. Is that worth doing,
>> versus adding an id to everyones's edges?
>>
>> Kyle
>>
>>
>> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Some form of graph querying support would be great to have. This can be
>>> a great community project hosted outside of Spark initially, both due to
>>> the maturity of the component itself as well as the maturity of query
>>> language standards (there isn't really a dominant standard for graph ql).
>>>
>>> One thing is that GraphX API will need to evolve and probably need to
>>> provide more primitives in order to support the new ql implementation.
>>> There might also be inherent mismatches in the way the external API is
>>> defined vs what GraphX can support. We should discuss those on a
>>> case-by-case basis.
>>>
>>>
>>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott <kellrott@soe.ucsc.edu>
>>> wrote:
>>>
>>>> I think its best to look to existing standard rather then try to make
>>>> your own. Of course small additions would need to be added to make it
>>>> valuable for the Spark community, like a method similar to Gremlin's
>>>> 'table' function, that produces an RDD instead.
>>>> But there may be a lot of extra code and data structures that would
>>>> need to be added to make it work, and those may not be directly applicable
>>>> to all GraphX users. I think it would be best run as a separate
>>>> module/project that builds directly on top of GraphX.
>>>>
>>>> Kyle
>>>>
>>>>
>>>>
>>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>>>> Brennon.York@capitalone.com> wrote:
>>>>
>>>>> My personal 2c is that, since GraphX is just beginning to provide a
>>>>> full featured graph API, I think it would be better to align with the
>>>>> TinkerPop group rather than roll our own. In my mind the benefits out
way
>>>>> the detriments as follows:
>>>>>
>>>>> Benefits:
>>>>> * GraphX gains the ability to become another core tenant within the
>>>>> TinkerPop community allowing a more diverse group of users into the Spark
>>>>> ecosystem.
>>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>>> graph API that has already been accepted by a wide audience, relieving
the
>>>>> pressure of “one off” API additions from the GraphX team.
>>>>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>>>>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>>>>> * Allows for the abstract graph traversal logic (query API) to be
>>>>> owned and maintained by a group already proven on the topic.
>>>>>
>>>>> Drawbacks:
>>>>> * GraphX doesn’t own the API for its graph query capability. This
>>>>> could be seen as good or bad, but it might make GraphX-specific
>>>>> implementation additions more tricky (possibly). Also, GraphX will need
to
>>>>> maintain the features described within the TinkerPop API as that might
>>>>> change in the future.
>>>>>
>>>>> From: Kushal Datta <kushal.datta@gmail.com>
>>>>> Date: Thursday, November 6, 2014 at 4:00 PM
>>>>> To: "York, Brennon" <brennon.york@capitalone.com>
>>>>> Cc: Kyle Ellrott <kellrott@soe.ucsc.edu>, Reynold Xin <
>>>>> rxin@databricks.com>, "dev@spark.apache.org" <dev@spark.apache.org>,
>>>>> Matthias Broecheler <matthias@thinkaurelius.com>
>>>>>
>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>
>>>>> Before we dive into the implementation details, what are the high
>>>>> level thoughts on Gremlin/GraphX? Scala already provides the procedural
way
>>>>> to query graphs in GraphX today. So, today I can run
>>>>> g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
>>>>> Gremlin, of course sans the useful operators that Gremlin offers such
as
>>>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
>>>>> to GraphX api's a better approach or should we extend the existing set
of
>>>>> transformations/actions that GraphX already offers with the useful
>>>>> operators from Gremlin? For example, we add as(), loop() and dedup()
>>>>> methods in VertexRDD and EdgeRDD.
>>>>>
>>>>> Either way we get a desperately needed graph query interface in GraphX.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>>>>> Brennon.York@capitalone.com> wrote:
>>>>>
>>>>>> This was my thought exactly with the TinkerPop3 release. Looks like,
>>>>>> to move this forward, we’d need to implement gremlin-core per <
>>>>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>>>>> The real question lies in whether GraphX can only support the OLTP
>>>>>> functionality, or if we can bake into it the OLAP requirements as
well. At
>>>>>> a first glance I believe we could create an entire OLAP system. If
so, I
>>>>>> believe we could do this in a set of parallel subtasks, those being
the
>>>>>> implementation of each of the individual API’s (Structure, Process,
and, if
>>>>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>>>>>
>>>>>>
>>>>>> From: Kyle Ellrott <kellrott@soe.ucsc.edu>
>>>>>> Date: Thursday, November 6, 2014 at 12:10 PM
>>>>>> To: Kushal Datta <kushal.datta@gmail.com>
>>>>>> Cc: Reynold Xin <rxin@databricks.com>, "York, Brennon" <
>>>>>> brennon.york@capitalone.com>, "dev@spark.apache.org" <
>>>>>> dev@spark.apache.org>, Matthias Broecheler <
>>>>>> matthias@thinkaurelius.com>
>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>
>>>>>> I still have to dig into the Tinkerpop3 internals (I started my work
>>>>>> long before it had been released), but I can say that to get the
Tinerpop2
>>>>>> Gremlin pipeline to work in the GraphX was a bit of a hack. The
>>>>>> whole Tinkerpop2 Gremlin design was based around streaming pipes
of
>>>>>> data, rather then large distributed map-reduce operations. I had
to hack
>>>>>> the pipes to aggregate all of the data and pass a single object wrapping
>>>>>> the GraphX RDDs down the pipes in a single go, rather then streaming
it
>>>>>> element by element.
>>>>>> Just based on their description, Tinkerpop3 may be more amenable
to
>>>>>> the Spark platform.
>>>>>>
>>>>>> Kyle
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta <kushal.datta@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> What do you guys think about the Tinkerpop3 Gremlin interface?
>>>>>>> It has MapReduce to run Gremlin operators in a distributed manner
>>>>>>> and Giraph to execute vertex programs.
>>>>>>>
>>>>>>> The Tinkpop3 is better suited for GraphX.
>>>>>>>
>>>>>>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott <kellrott@soe.ucsc.edu
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> I've taken a crack at implementing the TinkerPop Blueprints
API in
>>>>>>>> GraphX (
>>>>>>>> https://github.com/kellrott/sparkgraph ). I've also implemented
>>>>>>>> portions of
>>>>>>>> the Gremlin Search Language and a Parquet based graph store.
>>>>>>>> I've been working out finalize some code details and putting
>>>>>>>> together
>>>>>>>> better code examples and documentation before I started telling
>>>>>>>> people
>>>>>>>> about it.
>>>>>>>> But if you want to start looking at the code, I can answer
any
>>>>>>>> questions
>>>>>>>> you have. And if you would like to contribute, I would really
>>>>>>>> appreciate
>>>>>>>> the help.
>>>>>>>>
>>>>>>>> Kyle
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin <rxin@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > cc Matthias
>>>>>>>> >
>>>>>>>> > In the past we talked with Matthias and there were some
>>>>>>>> discussions about
>>>>>>>> > this.
>>>>>>>> >
>>>>>>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>>>>>>>> > Brennon.York@capitalone.com>
>>>>>>>> > wrote:
>>>>>>>> >
>>>>>>>> > > All, was wondering if there had been any discussion
around this
>>>>>>>> topic
>>>>>>>> > yet?
>>>>>>>> > > TinkerPop <https://github.com/tinkerpop>
is a great
>>>>>>>> abstraction for
>>>>>>>> > graph
>>>>>>>> > > databases and has been implemented across various
graph
>>>>>>>> database backends
>>>>>>>> > > / gaining traction. Has anyone thought about integrating
the
>>>>>>>> TinkerPop
>>>>>>>> > > framework with GraphX to enable GraphX as another
backend? Not
>>>>>>>> sure if
>>>>>>>> > > this has been brought up or not, but would certainly
volunteer
>>>>>>>> to
>>>>>>>> > > spearhead this effort if the community thinks it
to be a good
>>>>>>>> idea!
>>>>>>>> > >
>>>>>>>> > > As an aside, wasn¹t sure if this discussion should
happen on
>>>>>>>> the board
>>>>>>>> > > here or on JIRA, but a made a ticket as well for
reference:
>>>>>>>> > > https://issues.apache.org/jira/browse/SPARK-4279
>>>>>>>> > >
>>>>>>>> > > ________________________________________________________
>>>>>>>> > >
>>>>>>>> > > The information contained in this e-mail is confidential
and/or
>>>>>>>> > > proprietary to Capital One and/or its affiliates.
The
>>>>>>>> information
>>>>>>>> > > transmitted herewith is intended only for use by
the individual
>>>>>>>> or entity
>>>>>>>> > > to which it is addressed.  If the reader of this
message is not
>>>>>>>> the
>>>>>>>> > > intended recipient, you are hereby notified that
any review,
>>>>>>>> > > retransmission, dissemination, distribution, copying
or other
>>>>>>>> use of, or
>>>>>>>> > > taking of any action in reliance upon this information
is
>>>>>>>> strictly
>>>>>>>> > > prohibited. If you have received this communication
in error,
>>>>>>>> please
>>>>>>>> > > contact the sender and delete the material from
your computer.
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>> > > For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information contained in this e-mail is confidential and/or
>>>>>> proprietary to Capital One and/or its affiliates. The information
>>>>>> transmitted herewith is intended only for use by the individual or
entity
>>>>>> to which it is addressed.  If the reader of this message is not the
>>>>>> intended recipient, you are hereby notified that any review,
>>>>>> retransmission, dissemination, distribution, copying or other use
of, or
>>>>>> taking of any action in reliance upon this information is strictly
>>>>>> prohibited. If you have received this communication in error, please
>>>>>> contact the sender and delete the material from your computer.
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> The information contained in this e-mail is confidential and/or
>>>>> proprietary to Capital One and/or its affiliates. The information
>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>> to which it is addressed.  If the reader of this message is not the
>>>>> intended recipient, you are hereby notified that any review,
>>>>> retransmission, dissemination, distribution, copying or other use of,
or
>>>>> taking of any action in reliance upon this information is strictly
>>>>> prohibited. If you have received this communication in error, please
>>>>> contact the sender and delete the material from your computer.
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message